computationalstylistics / stylo

R package for stylometric analyses
172 stars 46 forks source link

Proposal for a replication or documentation mode #53

Open christofs opened 2 years ago

christofs commented 2 years ago

This is just an idea or a suggestion. With the way stylo works at the moment, it is not very easy to create results that can easily be replicated or precisely documented. The reasons for this, if my probably superficial analysis is correct, have something to do with the following aspects:

  1. A typical manner of using stylo is calling it from the R prompt, then setting parameters in the GUI. Or setting parameters in the prompt, but still, working from the R prompt rather than using an R script.
  2. Some files, notably the plots, are named according to some of the GUI options (e.g. "PCA" and "1000mfw"), but not all parameters are (and can be) included in the file name. In any case, "stylo_config.txt" is a much better place for documenting these parameters.
  3. However, "stylo_config" is overwritten at every fresh run, including when plots with new filenames are created. Then, the parameters for the earlier analysis are lost.
  4. The frequency table can be saved, of course, but again it is not easy to figure out later one which table was used exactly for producing one of the earlier plots.

In practice, this means people need to copy this stuff to a new folder whenever they think an analysis is good and should be kept. More often than not, by the time you realize this, the "stylo_config" is already overwritten and the table of frequencies wasn't saved or was overwritten in the meantime as well.

A simple solution for this could be a "replication mode" or "documentation mode" that can be activated when calling stylo. One could simply say: "documentation=TRUE". Then, the following things would happen:

  1. At every run, as long as that parameter is TRUE, a subfolder is created whose name is a simple timestamp (yyyy-mm-dd-hh-mm)
  2. Optionally, if the parameter "documentation.label = "arbitrary-label-string" is also given, that label can be appended to the subfolder name.
  3. In this folder, all data that is necessary to replicate the analysis is copied: the plot, the stylo_config, the frequency table, possibly other files like a "metadata.csv" if it was used for labeling.

Of course, this creates a lot of data. Folders that turn out not to be useful need to be deleted at some point. But at least no data is lost.

To replicate an analysis, one simply needs to set the working directory to the right time-stamped folder and run stylo again to repeat (and then possibly vary) the analysis. Maybe a parameter like "replication=TRUE" could be used to activate all parameters necessary, for instance to make sure stylo uses the frequency table from the documentation.

Maybe not quite thought out to the end, but something along these lines might be useful.

jmclawson commented 1 year ago

I took a stab at this today. https://gist.github.com/jmclawson/52252349dd100e426c2267b5de48aade

Does the code make sense as you imagine it, @christofs ? There are mainly two functions it makes available.

stylo_log

The first, stylo_log(), accepts a stylo() object that has just been created, and it logs the date and time, the stylo call, and the config file. It's used like this:

# option 1: pipe from stylo() into stylo_log()
stylo() |> stylo_log()

# option 2: enclose stylo() in stylo_log()
stylo_log(stylo())

# option 3: call stylo_log() on a stylo object immediately after creating it:
my_object <- stylo()
stylo_log(my_object)

Options exist, including log_label to redefine the label of the folder and log file, add_dir_date to add a date to the folder name (by default it doesn't do this), and log_date for appending a date to the end of the text file (with a default value of Sys.Date()). At its simplest, stylo_log() will create a folder called "stylo_log" containing a text log file for each day analyses are run.

At the same time that it appends the call and configuration to a log file, it also copies any files made at the same time as stylo_config.txt into the directory used for logging, prepending each of their file names with the date and time they were originally created.

stylo_replicate

The second function, stylo_replicate(), is a little more complex. It will do two things:

  1. If it is not passed a date_time argument, it will run both stylo() and stylo_log(), passing along the log_label, add_dir_date, and log_date arguments to stylo_log(), while passing along ... to stylo(). It's used like this: stylo_replicate() (with the parentheses accepting anything that will work with stylo())
  2. If date_time is passed as an argument, it will parse the appropriate log (with "appropriate" defined by defaults or by the log_label, add_dir_date, and log_date arguments) created by stylo_log() to find the settings used for a previous analysis from that date and time, and it will re-run the analysis using the same settings, and add an item to the log. It is used like this: stylo_replicate("2023-01-27 13:46:26")
christofs commented 1 year ago

Sounds cool! Will report back as soon as I was able to do a test run.