alife-data-standards / alife-data-standards

Repository to host data standards for the ALIFE community.
https://alife-data-standards.github.io/alife-data-standards/
MIT License
15 stars 2 forks source link

Empty ancestor list convention? #19

Open mmore500 opened 1 year ago

mmore500 commented 1 year ago

There currently is not direct guidance for how to represent "origin of life" (i.e., no ancestors) entries in the standard. The toy example represents these as a list with a single none entry: [none]. Should this be adopted as the convention?

Other possibilities would be:

I think the empty list would be the easiest to justify and perhaps the easiest to support from an implementation perspective.

Not sure what existing tools and datasets are assuming. Perhaps we could have a list of allowed possibilities, maybe with one "preferred" and others not preferred or maybe even "deprecated."

This question is somewhat representational on an implementation-to-implementation basis, so may be veering towards out of scope for the standard.

mmore500 commented 1 year ago

On closer inspection, as far as I can tell, it looks like denoting a list with brackets [/]and comma syntax isn't explicitly specified on the website.

Adopting or allowing a convention without the brackets where list entries are interspersed with a non-comma separator (perhaps ;) may be advantageous. This would allow asexual phylogenies where each individual has at most one parent to appear directly as a simple integer column (instead of a string column) in data table software, which would greatly speed up working with such files.

Allowing a convention where an integer placeholder (i.e., own id or -1) is used to denote "no ancestor" would also be required to have this show up as an integer column. Otherwise, the empty string entry for lineage-originating organisms would be interpreted as nan and the column would show up as a float column.

A final thought tangentially related to these representation issues, it may be worthwhile to cap integer id's at 2^64 so C/C++ programs etc. can guarantee they will fit inside standard data type widths.

mmore500 commented 1 year ago

Consideration of this issue will have to consider representation in non-tabular format e.g., json.

FergusonAJ commented 1 year ago

I definitely think specifying a convention for organisms with no ancestors is within the scope of the standard! This is something I've ran into as well. As @mmore500 mentions, the toy example in the phylogeny standard uses [none] while the phylogeny visualization tool expects [-1].

@mmore500 brings up some interesting points on what value should be used. Personally, -1/[-1] or an empty list are the most intuitive to me.

mmore500 commented 1 year ago

For asexual phylogenies I have found creating a separate column ancestor_id that has the ancestor’s id (or self ID if no ancestor) to be useful and not very expensive to generate so getting ancestor_list to boil down to integer values for asexual phylogenies is definitely a less important consideration when weighing trade offs.

emilydolson commented 1 year ago

We've gone back and forth on this a few times, but I actually really like the idea of a taxon's own id as its ancestor being the indicator that its the origin of life. The reason we had veered away from using -1 is that it would be possible for someone to legitimately be using -1 as an ID. The challenge with using an empty list is that it could also just mean missing data (as opposed to knowing for sure that this taxon has no ancestor). Self ID is the only ID that we know for sure isn't in use for anything else.

That said, I could see this being unintuitive for new users.

Re: list vs. delimited string - this does seem like we're either going to inconvenience asexual phylogenies or sexual phylogenies, so I think the workaround of creating an additional ancestor_id column might make sense. We could even officially suggest it.

mmore500 commented 1 year ago

The one downside of the self id loop is that to find origin of life organisms then requires comparison of two columns. I don’t think this is too onerous from a

At this point though, does adding more options just make everything more complicated? Maybe just establishing an explicit convention for the null value (i.e., “None” or “none”, JSON uses null) and specifying representational issues a little more tightly could get most of the benefit without adding complexity. Not sure.

As constructed, they standard aims to be agnostic to the file type (csv, json, etc.). In practice, though, probably most tools/users will only build the capability to work with one of them and they will have to take into account some of the representational questions especially with respect to CSV (e.g., what gets quoted in the ancestor list column… for empty values: [none] [“none”] or “[none]” for singleton values: [1] or “[1]”, for multiple values only “[1,2]” will work due to the comma.

The really hard part of this problem is that “”’s are added opaquely in most csv output systems (for example, only added if a string column value contains a ‘,’’). This means that fine-grained user control to comply with a standard could be difficult if we choose a standard that cuts against the grain of these systems’ assumptions.

If we want to lean o On Thu, Feb 9, 2023 at 11:07 Emily Dolson @.***> wrote:

We've gone back and forth on this a few times, but I actually really like the idea of a taxon's own id as its ancestor being the indicator that its the origin of life. The reason we had veered away from using -1 is that it would be possible for someone to legitimately be using -1 as an ID. The challenge with using an empty list is that it could also just mean missing data (as opposed to knowing for sure that this taxon has no ancestor). Self ID is the only ID that we know for sure isn't in use for anything else.

That said, I could see this being unintuitive for new users.

Re: list vs. delimited string - this does seem like we're either going to inconvenience asexual phylogenies or sexual phylogenies, so I think the workaround of creating an additional ancestor_id column might make sense. We could even officially suggest it.

— Reply to this email directly, view it on GitHub https://github.com/alife-data-standards/alife-data-standards/issues/19#issuecomment-1424682038, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSDYRKQMAXKNB45GJGEEPDWWU57FANCNFSM6AAAAAAUJEKBRU . You are receiving this because you were mentioned.Message ID: @.*** com>

-- Matthew Andres Moreno https://mmore500.github.io

emilydolson commented 1 year ago

I think what we were originally discussing was having the standard be whatever the Python CSV writer would spit out in terms of quotation marks.

I'm pretty sure the standard for original ancestors that we had basically landed on last time we discussed this was ["none"], so maybe its best if we just stick to that.

Any objection to codifying this in the standards @amlalejini @cliff-bohm ? I think we probably meant to a while ago.

mmore500 commented 1 year ago

In agreement. Would it be helpful for me to open a pull request that drafts some language to add these points?

Matthew

On Mon, Feb 27, 2023 at 19:02 Emily Dolson @.***> wrote:

I think what we were originally discussing was having the standard be whatever the Python CSV writer would spit out in terms of quotation marks.

I'm pretty sure the standard for original ancestors that we had basically landed on last time we discussed this was ["none"], so maybe its best if we just stick to that.

Any objection to codifying this in the standards @amlalejini https://github.com/amlalejini @cliff-bohm https://github.com/cliff-bohm ? I think we probably meant to a while ago.

— Reply to this email directly, view it on GitHub https://github.com/alife-data-standards/alife-data-standards/issues/19#issuecomment-1447498539, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACSDYRNFKUL6LGOKG7BPKNLWZVTFLANCNFSM6AAAAAAUJEKBRU . You are receiving this because you were mentioned.Message ID: @.*** com>

-- Matthew Andres Moreno https://mmore500.github.io

emilydolson commented 1 year ago

It would! Thank you!

The one thing I'll say is that in messing with different standards-compliant files over the last few weeks I've been realizing that the subtle differences between quotation marks vs. none and the capitalization of None make a pretty big difference. The most important thing to do at this point is probably to just pick one, though