Consider a balance between R-like NA values and -1 values in slendr tables

bodkan commented 2 years ago

Pointed out here by @bhaller.

There are two sets of tables in slendr:

Heavily annotated outputs of ts_nodes() and ts_edges() -- tables which contain lots of spatial data, custom slendr time units, symbolic names of individuals, etc... These are intended for R data analysis, so using NAs to indicate missing data (i.e. missing ancestors, etc.) makes sense. The user doing data analysis can use the same analysis patterns that apply to any other missing data problems in R: is.na(), na.omit(), complete.cases(), etc...
R-friendly access functions to "raw" unannotated tree-sequence tables (node/individual/edge/mutation/site tables). These are read-only copies of tables contained by tskit Python object. To be honest, I think these are quite useless because there's very little reason to use those for meaningful data analysis -- the outputs of ts_nodes() and ts_edges() contain the same information, and much more on top of that. Still, I allow their access via ts_table(ts, "nodes|edges|individuals|...") just in case, sometimes this can be useful for debugging.

Given that the tables in the second group are supposed to give an idea about the low-level information in the tree-sequence, it does make sense to report missing data as -1 values, despite the inconvenience (or potentially even nasty surprise -- which is how I discovered this when some NA-based join operations failed) for data analysis.

The NA-missingness is used heavily throughout the slendr codebase though -- a reasonable compromise is to convert NA to -1 values on access to ts_table(ts, "nodes|edges|individuals|...").

bodkan commented 2 years ago

Also, while I'm at it, I should clarify the purpose of ts_nodes(ts) & ts_edges(ts) vs ts_table(ts, "nodes|edges|individuals|..."). The former is intended for data analysis and visualization in R, the latter is really only useful for poking into low-level tree-sequence details, probably only for debugging purposes (definitely not useful for data analysis).

ts_nodes() and ts_edges() are the only functions prominently described in the paper and tutorials (I don't think the existence of ts_tables(...) is acknowledged anywhere but the reference list of functions) but still, the way things are now, people who are familiar with tskit might be confused by the distinction between them.

bodkan commented 6 months ago

It turned out that this doesn’t seem to be a real problem in practice. Plus, the man pages of metadata-annotated and “raw” table-extraction functions have been referring to each other for a while now, so tskit users can learn how they differ from each other.

bodkan / slendr

Consider a balance between R-like NA values and -1 values in slendr tables #115