bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

Consider a balance between R-like NA values and -1 values in slendr tables #115

Closed bodkan closed 6 months ago

bodkan commented 2 years ago

Pointed out here by @bhaller.

There are two sets of tables in slendr:

Given that the tables in the second group are supposed to give an idea about the low-level information in the tree-sequence, it does make sense to report missing data as -1 values, despite the inconvenience (or potentially even nasty surprise -- which is how I discovered this when some NA-based join operations failed) for data analysis.

The NA-missingness is used heavily throughout the slendr codebase though -- a reasonable compromise is to convert NA to -1 values on access to ts_table(ts, "nodes|edges|individuals|...").

bodkan commented 2 years ago

Also, while I'm at it, I should clarify the purpose of ts_nodes(ts) & ts_edges(ts) vs ts_table(ts, "nodes|edges|individuals|..."). The former is intended for data analysis and visualization in R, the latter is really only useful for poking into low-level tree-sequence details, probably only for debugging purposes (definitely not useful for data analysis).

ts_nodes() and ts_edges() are the only functions prominently described in the paper and tutorials (I don't think the existence of ts_tables(...) is acknowledged anywhere but the reference list of functions) but still, the way things are now, people who are familiar with tskit might be confused by the distinction between them.

bodkan commented 6 months ago

It turned out that this doesn’t seem to be a real problem in practice. Plus, the man pages of metadata-annotated and “raw” table-extraction functions have been referring to each other for a while now, so tskit users can learn how they differ from each other.