ideas for further support of dataframe-like Dictionaries

bhaller commented 3 years ago

In https://github.com/bodkan/slendr/pull/66, @bodkan and I have discussed things he does in his slendr scripts to facilitate dataframe-like usage of Dictionary. Some of those things could be carried over into Eidos. To quote:

filter(), also interesting. Something like this could also conceivably be added to Eidos.

num_rows(), ditto. Possibly this idea of a Dictionary as a dataframe-like object ought to be more formalized. Eidos could define a subclass of Dictionary named DataFrame that defined more of this sort of functionality.

print_table(), ah, interesting. Yes, the default print for Dictionary is not so readable when the Dictionary is, conceptually, a dataframe. Again, if DataFrame were a subclass this could be handled in Eidos.

These would be very useful additions indeed, even after I make the JSON switch once the slendr "format" settles down.

So, the idea is to maybe make a DataFrame (Dataframe?) subclass of Dictionary that formalizes this usage pattern and provides additional support. The fact that it's a Dictionary subclass would maybe not be something that is committed to formally; maybe that's an implementation detail that is not guaranteed to the end user, in case we want to shift to an underlying implementation more like R. In any case, things like counting the number of rows in the dataframe, printing it in a readable way – and I forget what @bodkan's filter() does, but that too. :->

Some dataframe-like functionality has already been put into Dictionary. That might get moved down to the subclass, but that would break backward compatibility for those still using it. It might therefore be left where it is – but that does clutter up the Dictionary interface, and makes for a confusing API design. In particular, the getRowValues() method might move; appendKeysAndValuesFrom() might also. IIRC, @grahamgower has an interest in this. Comments/ideas welcome.

petrelharp commented 3 years ago

I vote for DataFrame over Dataframe. =)

bodkan commented 3 years ago

Another vote for DataFrame over Dataframe!

I forget what @bodkan's filter() does, but that too . :->

The filter() function in slendr is a primitive implementation of the semantics of the filter() function from R's tidyverse dplyr package (the link has some useful examples). Briefly, given 1. an Eidos DataFrame/Dictionary called d, 2. a column name, and 3. a value (or a vector of values), this function returns only those rows of d for which the column matches those values.

For instance, if GENEFLOWS here is a DataFrame/Dictionary with all gene flow events to be scheduled at some point during the simulation, calling events = filter(GENEFLOWS, "tstart_gen", sim.generation - BURNIN_LENGTH); will return only those events that are to be activate in the scheduled generation (events from slendr are passed in "1-based time coordinates", so burnin length is subtracted during the actual runtime).

I also implemented a negation of this, where specifying negate = T flips the condition around. For example, calling cleanups = filter(POPULATIONS, "tremove_gen", -1, negate = T) returns only those populations which are scheduled for removal from the simulation at some point (those that are not have the tremove_gen field in the table set to a default value of -1). In R one could do something like tremove_gen != -1 directly in filter, but that's because of non-standard evaluation. Having an explicit negate = argument works well too, I think.

This implementation doesn't cover all the crazy filtering types one might want to do, but I don't think extensive support for complex table-munging would be the goal here.

One thing that is slightly tricky is printing out larger tables in a readable form. R has features for guessing how many columns (and rows) of a table are reasonable to print out given the current size of the terminal, so that each row of a data frame is only printed on a single row in the terminal (omitting columns which would cause the row to break over multiple terminal rows). A print_table() function that I implemented gets around that by printing a column name for each cell and also clearly separating (and numbering) rows from each other. This takes up more space, but it does make it clear which number belongs to which column and which row:

> df = Dictionary("x", 1:3, "y", 4:6, "z", 6:8)
>
> print_table(df)

row #0
-------
| x: 1 | y: 4 | z: 6 | 

row #1
-------
| x: 2 | y: 5 | z: 7 | 

row #2
-------
| x: 3 | y: 6 | z: 8 |

Just some thoughts. Regardless of which functions would potentially make it to Eidos, having some basic DataFrame support would be great. And I definitely agree that the Dictionary subclassing would not need to be exposed and could remain an implementation detail.

bhaller commented 3 years ago

OK, here's a sketch of what I plan to do:

Add a new class, DataFrame, that is a subclass of Dictionary (I think, after reflection, this subclass relationship will be public/supported)
DataFrame will conceptually make the Dictionary keys into columns, and the Dictionary values into the rows of each column
It will enforce that every column must be the same length, and that values that are matrices/arrays are not allowed; values must be simple vectors of uniform length
It will provide new methods, subsetRows() and subsetColumns(), that will return DataFrames with rows/columns selected by index, by logical vector, or (for columns) by name
The existing Dictionary method getRowValues(), which does much the same thing as the proposed subsetRows(), will be deprecated and will warn once the first time it is used in a session, but will not be removed for now (preserving backward compatibility)
New methods nrow() and ncol() and dim() will be added, gesturing towards R (note that since in Eidos it would be possible to make a vector/matrix/array of DataFrame objects, the Eidos nrow() / ncol() / dim() functions will not and should not do what these methods will do)
Similarly, new methods rbind() and cbind() be will added (again, the Eidos rbind() / cbind() functions will not and should not do what these methods do); the existing Dictionary method appendKeysAndValuesFrom() will be kept and not deprecated since it might be useful for non-dataframe-ish uses
DataFrame will print in the Eidos console similarly to how R prints dataframes; thanks for the info on that @bodkan, I'll contemplate my options on the right format. Perhaps the standard print will just assume an infinitely wide console, but some kind of print() method could be provided with other options for printing (supply a width, or request a style like you describe, or who knows what). TBD.
I'm not sure about filter(). You can just make a logical vector and then subset the dataframe with that, right? I'm not really sure this deserves to be its own separate function. Like df.subsetRows(df.getValue(key) == value) or df.subsetRows(df.getValue(key) != value); that seems quite concise, so why is a separate function needed? Doing it this way is also more general/flexible since then you can easily do things like df.subsetRows((df.getValue(key1) == value1) & (df.getValue(key2) == value2)). I dunno. I'm leaning towards not adding it, at least for now, but if there's a case to be made for it I'm receptive.
DataFrame will be constructable in the same ways as Dictionary, more or less.
We want to support reading in of CSV/TSV files. I think the best way is to add a new readCSV() function to Eidos that returns a DataFrame, rather than trying to wedge it into the already-overloaded DataFrame constructor. It would probably look much like R's read.csv(): something like readCSV(file, [l$ header=T], [s$ sep=","], [s$ quote="\""], [s$ dec="."], [s$ comment=""]). I don't think I'll support R's fill parameter, since Eidos doesn't have NA, and in any case I tend to dislike this sort of automatic compensation for malformed data. I don't think I'll provide equivalents to R's read.csv2(), read.delim(), and read.delim2(); just pass the delimiters you want to readCSV(). I also won't provide an equivalent to read.table() at this time, although that's a possibility for the future if greater flexibility than readCSV() provides is needed.
We also want to support writing out of CSV/TSV files. For maximum flexibility, I think this will be provided as new "csv" and "tsv" options to the existing serialize() method on Dictionary. This lets you get a string representation is CSV/TSV that you can then pass to writeFile(), to writeTempFile(), to cat(), or whatever else you might want to do to it (prepending comment lines to it, etc.). The Image class has its own write() method, but that's because it writes out binary data rather than text. These new serialize options would actually be supported on Dictionary too, writing out nothing for a given field in a given row if the dictionary is ragged and has no value at that position; i.e., ",," or "\t\t" with no value between the delimiters (I think that's the right behavior?).
Dictionary and DataFrame will be fairly interoperable; you'll be able to create a Dictionary from a DataFrame, or create a DataFrame from a Dictionary.

That's everything I can think of for now. I plan to go forward with this work immediately, so feedback is better if it is given ASAP. :->

bodkan commented 3 years ago

This sounds perfect!

I'm not sure about filter(). You can just make a logical vector and then subset the dataframe with that, right? I'm not really sure this deserves to be its own separate function. Like df.subsetRows(df.getValue(key) == value) or df.subsetRows(df.getValue(key) != value); that seems quite concise, so why is a separate function needed? Doing it this way is also more general/flexible since then you can easily do things like df.subsetRows((df.getValue(key1) == value1) & (df.getValue(key2) == value2)). I dunno. I'm leaning towards not adding it, at least for now, but if there's a case to be made for it I'm receptive.

That makes sense. I guess I’m too indoctrinated by the tidyverse. :) What you propose is exactly how “base R” does it, which is why it’s probably better to keep this way of interacting with data frames in Eidos as well.

These new serialize options would actually be supported on Dictionary too, writing out nothing for a given field in a given row if the dictionary is ragged and has no value at that position; i.e., ",," or "\t\t" with no value between the delimiters (I think that's the right behavior?).

👍

petrelharp commented 3 years ago

Sounds perfect. My only suggestion is that you not worry about doing sensible CSV output for non-DataFrame dictionaries.

bhaller commented 3 years ago

Sounds perfect. My only suggestion is that you not worry about doing sensible CSV output for non-DataFrame dictionaries.

It'll come out for free. :->

bhaller commented 3 years ago

OK, CSV is a pain in the butt. Who comes up with these crappy data exchange formats?? Anyway, if anyone cares to review this, here's my proposal for readCSV() in Eidos. Don't feel you have to read this, it's probably fine. :->

(object<DataFrame>$)readCSV(string$ filePath, [logical$ header = T], [string$ sep = ","], [string$ quote = "\""], [string$ dec = "."], [string$ comment = ""])

Reads data from a CSV or other delimited file specified by filePath and returns a DataFrame object containing the data in a tabular form. CSV (comma-separated value) files use a somewhat standard file format in which a table of data is provided, with values within a row separated by commas, while rows in the table are separated by newlines. Software from R to Excel (and Eidos; see the serialize() method of Dictionary) can export data in CSV format. This function can actually also read files that use a delimiter other than commas; TSV (tab-separated value) files are a popular alternative. Since there is substantial variation in the exact file format for CSV files, this documentation will try to specify the precise format expect by this function. Note that CSV files represent values quite differently that Eidos usually does, and some of the format options allowed in CSV files (such as decimal commas) are not otherwise available in Eidos.

If header is T (the default), the first row of data is taken to be a header, containing the string names of the columns in the data table; those names will be used by the resulting DataFrame. If header is F, a header row is not expected and column names are auto-generated as V1, V2, etc.

The separator between values is supplied by sep; it is a comma by default, but a tab can be used instead by supplying tab ("\t" in Eidos), or another character may also be used.

Similarly, the character used to quote string values is a double quote ("\"" in Eidos), by default, but another character may be supplied in quote. When the string delimiter is encountered, all following characters are considered to be part of the string until another string delimiter is encountered, terminating the string; this includes spaces, comment characters, newlines, and everything else. Within a string value, the string delimiter itself is used twice in a row to indicate that the delimiter itself is present within the string; for example, if the string value (shown without the usual surrounding quotes to try to avoid confusion) is she said "hello", and the string delimiter is the double quote as it is by default, then in the CSV file the value would be given as "she said ""hello""". The usual Eidos style of escaping characters using a backslash is not part of the CSV standard followed here. (When a string value is provided without using the string delimiter, all following characters are considered part of the string except a newline, the value separator sep, the quote separator quote, and the comment separator comment; if none of those characters are present in the string value, the quote separator may be omitted.)

The character used to indicate a decimal delimiter in numbers may be supplied with dec; by default this is "." (and so 10.0 would be ten, written with a decimal point), but "," is common in European data files (and so 10,0 would be ten, written with a decimal comma). Note that dec and sep may not be the same, so that it is unambiguous whether 10,0 is two numbers (10 and 0) or one number (10.0). For this reason, European CSV files that use a decimal comma typically use a semicolon as the value separator, which may be supplied with sep=";" to readCSV().

Finally, the remainder of a line following a comment character will be ignored when the file is read; by default comment is the empty string, "", indicating that comments do not exist at all, but "#" is a popular comment prefix.

To translate the CSV data into a DataFrame, it is necessary for Eidos to guess what value type each column is. Quotes surrounding a value are irrelevant to this guess; for example, 1997 and "1997" are both candidates to be integer values (because some programs generate CSV output in which every value is quoted regardless of type). If every value in a column is either true, false, TRUE, FALSE, T, or F, the column will be taken to be logical. Otherwise, if every value in a column is an integer (here defined as an optional + or -, followed by nothing but decimal digits 0123456789), the column will be taken to be integer. Otherwise, if every value in a column is a floating-point number (here defined as an optional + or -, followed by decimal digits 0123456789 and ending with an optional exponent like e7, E+05, or e-2), the column will be taken to be float; the special values NAN, INF, INFINITY, -INF, and -INFINITY (not case-sensitive) are also candidates to be float (if the rest of the column is also convertible to float), representing the corresponding float constants. Otherwise, the column will be taken to be string. NULL and NA are not recognized in CSV files and will be read as strings. Every line in a CSV file must contain the same number of values (forming a rectangular data table); missing values are not allowed by readCSV() since there is no way to represent them in DataFrame (since Eidos has no equivalent of R’s NA value). Spaces are considered part of a data field and are not trimmed, following the RFC 4180 standard. These choices are an attempt to provide optimal behavior for most clients, but given the lack of any universal standard for CSV files, and the lack of any type information in the CSV format, they will not always work as desired; in such cases, it should be reasonably straightforward to preprocess input files using standard Unix text-processiong tools like sed and awk.

bodkan commented 3 years ago

This looks great, Ben!

One idea: functions such as read_csv and friends in tidyverse in R have a col_types column which can be used to help the parser with determining column types (or perhaps override what would the parser normally guess to be the column type). For instance, specifying read_tsv(..., col_types = "ciicl", ...) would parse a table with column with types (in order) character, integer, integer, character, logical. For a more detailed description, see col_types under "Arguments" section of the manual.

Is the something that could be added to the interface? I don't think this is always needed, but I'm thinking about situations which motivated our "type inference" discussion w.r.t. R <--> slendr <--> SLiM in the first place. I.e. cases where R saves a number in one format, which is supposed to be an integer, but it's actually loaded as float, etc. There could perhaps be other similar cases. Giving the user the power to determine which exact types they want to get out of the csv/tsv might be useful.

petrelharp commented 3 years ago

Sounds great! And yeah, sounds like a pain (too much flexibility?).

bhaller commented 3 years ago

One idea: functions such as read_csv and friends in tidyverse in R have a col_types column...Is the something that could be added to the interface?

Yes, I've been pondering that already, because the automatic type inference is not perfect. Seems like a good idea, I'll add it to the spec.

And yeah, sounds like a pain (too much flexibility?).

And the lack of a standard spec for CSV – and to the extent that it is standardized that standard is poorly designed. It's just a can of worms, there's no getting around it. Stuff like choosing what the different delimiters are, etc., is a pain but is pretty necessary because the format just isn't standardized.

bhaller commented 3 years ago

OK, here's a new spec. Adds colTypes as suggested; also changes header to colNames, following the tidyverse, since that seems better. Posted here for posterity, no need to read through it all again unless you're feeling masochistic. :->

(object<DataFrame>$)readCSV(string$ filePath, [ls colNames = T], [Ns$ colTypes = NULL], [string$ sep = ","], [string$ quote = "\""], [string$ dec = "."], [string$ comment = ""])

Reads data from a CSV or other delimited file specified by filePath and returns a DataFrame object containing the data in a tabular form. CSV (comma-separated value) files use a somewhat standard file format in which a table of data is provided, with values within a row separated by commas, while rows in the table are separated by newlines. Software from R to Excel (and Eidos; see the serialize() method of Dictionary) can export data in CSV format. This function can actually also read files that use a delimiter other than commas; TSV (tab-separated value) files are a popular alternative. Since there is substantial variation in the exact file format for CSV files, this documentation will try to specify the precise format expected by this function. Note that CSV files represent values differently that Eidos usually does, and some of the format options allowed by readCSV(), such as decimal commas, are not otherwise available in Eidos.

If colNames is T (the default), the first row of data is taken to be a header, containing the string names of the columns in the data table; those names will be used by the resulting DataFrame. If colNames is F, a header row is not expected and column names are auto-generated as X1, X2, etc. If colNames is a string vector, a header row is not expected and colNames will be used as the column names; if additional columns exist beyond the length of colNames their names will be auto-generated. Duplicate column names will generate a warning and be made unique.

If colTypes is NULL (the default), the value type for each column will be guessed from the values it contains, as described below. If colTypes is a singleton string, it should contain single-letter codes indicating the desired type for each column, from left to right. The letters lifs have the same meaning as in Eidos signatures (logical, integer, float, and string); in addition, ? may be used to indicate that the type for that column should be guessed as by default, and _ or - may be used to indicate that that column should be skipped – omitted from the returned DataFrame. Other characters in colTypes will result in an error. If additional columns exist beyond the end of the colTypes string their types will be guessed as by default.

The separator between values is supplied by sep; it is a comma by default, but a tab can be used instead by supplying tab ("\t" in Eidos), or another character may also be used.

Similarly, the character used to quote string values is a double quote ("\"" in Eidos), by default, but another character may be supplied in quote. When the string delimiter is encountered, all following characters are considered to be part of the string until another string delimiter is encountered, terminating the string; this includes spaces, comment characters, newlines, and everything else. Within a string value, the string delimiter itself is used twice in a row to indicate that the delimiter itself is present within the string; for example, if the string value (shown without the usual surrounding quotes to try to avoid confusion) is she said "hello", and the string delimiter is the double quote as it is by default, then in the CSV file the value would be given as "she said ""hello""". The usual Eidos style of escaping characters using a backslash is not part of the CSV standard followed here. (When a string value is provided without using the string delimiter, all following characters are considered part of the string except a newline, the value separator sep, the quote separator quote, and the comment separator comment; if none of those characters are present in the string value, the quote delimiter may be omitted.)

The character used to indicate a decimal delimiter in numbers may be supplied with dec; by default this is "." (and so 10.0 would be ten, written with a decimal point), but "," is common in European data files (and so 10,0 would be ten, written with a decimal comma). Note that dec and sep may not be the same, so that it is unambiguous whether 10,0 is two numbers (10 and 0) or one number (10.0). For this reason, European CSV files that use a decimal comma typically use a semicolon as the value separator, which may be supplied with sep=";" to readCSV().

Finally, the remainder of a line following a comment character will be ignored when the file is read; by default comment is the empty string, "", indicating that comments do not exist at all, but "#" is a popular comment prefix.

To translate the CSV data into a DataFrame, it is necessary for Eidos to guess what value type each column is unless a column type is specified by colTypes. Quotes surrounding a value are irrelevant to this guess; for example, 1997 and "1997" are both candidates to be integer values (because some programs generate CSV output in which every value is quoted regardless of type). If every value in a column is either true, false, TRUE, FALSE, T, or F, the column will be taken to be logical. Otherwise, if every value in a column is an integer (here defined as an optional + or -, followed by nothing but decimal digits 0123456789), the column will be taken to be integer. Otherwise, if every value in a column is a floating-point number (here defined as an optional + or -, followed by decimal digits 0123456789 and ending with an optional exponent like e7, E+05, or e-2), the column will be taken to be float; the special values NAN, INF, INFINITY, -INF, and -INFINITY (not case-sensitive) are also candidates to be float (if the rest of the column is also convertible to float), representing the corresponding float constants. Otherwise, the column will be taken to be string. NULL and NA are not recognized by readCSV() in CSV files and will be read as strings. Every line in a CSV file must contain the same number of values (forming a rectangular data table); missing values are not allowed by readCSV() since there is no way to represent them in DataFrame (since Eidos has no equivalent of R’s NA value). Spaces are considered part of a data field and are not trimmed, following the RFC 4180 standard. These choices are an attempt to provide optimal behavior for most clients, but given the lack of any universal standard for CSV files, and the lack of any type information in the CSV format, they will not always work as desired; in such cases, it should be reasonably straightforward to preprocess input files using standard Unix text-processing tools like sed and awk.

bhaller commented 3 years ago

OK, DataFrame is added, readCSV() is added, serialize("csv") is added. Doc for everything is live in SLiMgui if you build that from GitHub, otherwise I can email a new manual PDF to anybody who wants it. Fallout should be minimal. @bodkan it would be great if you could try porting slendr over to this and let me know if there are any additional features you need to make it work smoothly for you; you are the real-world test case for this, at the moment. I didn't do anything equivalent to your filter() since that's trivial, and printing of DataFrame assumes an infinite-width console at the moment; let me know if that presents a problem for your usage of it. Other things are more or less as discussed above. nrow / ncol / dim ended up as properties, no reason for them to be methods. Subsetting of DataFrame is less pretty than in R, since Eidos doesn't presently support operator overloading by classes, so you can't use [] or $, but the subset() / subsetRows() / subsetColumns() methods should work reasonably well (plus getValue() inherited from Dictionary). I added a whole bunch of unit tests for the new stuff, but quite a bit of code got added here, so there might be bugs to be found; let me know if you see anything that seems wrong. Thanks!

bodkan commented 3 years ago

Great! Thanks for the summary.

I'll get the SLiM GitHub dev version compiled later this week and start porting things over to the new DataFrame. I will open a PR in the slendr repo once I get something presentable.

MesserLab / SLiM

ideas for further support of dataframe-like Dictionaries #228