Add `name` and `description` to population metadata

petrelharp commented 3 years ago

msprime v1.0 by default tries to use "name" and "description" properties pulled from population metadata. This is very convenient, as it lets one specify demography using population names instead of IDs. And, it's a good idea. I think we should add these to the population metadata (and am happy to do it sometime before the next release).

bhaller commented 3 years ago

Sure. I guess the "name" for p1 would be "p1", etc.? Are you proposing to add a "description" to SLiM subpops? That is not unreasonable, but would require new Eidos API.

petrelharp commented 3 years ago

Right - I'm just proposing using p1, etcetera, and not using the "description" field.

And: we'd avoid some future headaches if we switched population metadata to using JSON instead of a struct, and set additionalProperties=True in the schema. (I'm not suggesting this for other metadata, for performance, though.)

bhaller commented 3 years ago

Right - I'm just proposing using p1, etcetera, and not using the "description" field.

OK, that seems like a good idea.

And: we'd avoid some future headaches if we switched population metadata to using JSON instead of a struct, and set additionalProperties=True in the schema. (I'm not suggesting this for other metadata, for performance, though.)

Interesting. I guess if we increment the overall file version at the same time, then we know whether we expect the metadata to be JSON or struct, so we can safely read old files. Sounds like a bit of a pain, but do-able. Sure. I think we have kind of an existing example of how to handle this, with MutationMetadataRec versus MutationMetadataRec_PRENUC, although it's not quite the same since JSON is not involved. I guess for this proposed change we would change the name of SubpopulationMetadataRec to SubpopulationMetadataRec_PREJSON or some such, and use that for the backward-compatibility code, and we wouldn't have a SubpopulationMetadataRec at all; we'd have JSON encoding/decoding stuff instead, to/from the bytes in the metadata field. Should be straightforward enough, let me know if you get stuck?

grahamgower commented 3 years ago

Maybe it would be reasonable to have name and description fields in the Subpopulation class, and write those out to the tree sequence metadata? As you say, the name could just default to "p1", etc. if not set by the user.

grahamgower commented 3 years ago

Oh, and if you end up pulling in a JSON parser, I might have future JSON-related feature requests. :)

bhaller commented 3 years ago

Maybe it would be reasonable to have name and description fields in the Subpopulation class, and write those out to the tree sequence metadata? As you say, the name could just default to "p1", etc. if not set by the user.

This is certainly possible; I guess the question is whether the value-add justifies the API complexification. Do you feel that it does?

bhaller commented 3 years ago

Oh, and if you end up pulling in a JSON parser, I might have future JSON-related feature requests. :)

SLiM already has a JSON parser, so JSON-related feature requests are certainly possible. :->

grahamgower commented 3 years ago

This is certainly possible; I guess the question is whether the value-add justifies the API complexification. Do you feel that it does?

Hmm. Ok, the goal is for users to be able to specify the 'name' metadata field for each population that is output in the tree sequence file. Having to specify a numerical id to identify a population in a tree sequence file is a fairly big source of user error (at least, I do it often enough). It also means that downstream tools, like stdpopsim, require users to consider numerical population ids too. And the tskit metadata improvements will completely overcome this problem---assuming that populations are acutally given human-readable names when the tree sequence is initially created.

What the API looks like to do this is less important than being able to do it at all. And if you can do it for the name field, then it makes sense to enable the description field too.

bhaller commented 3 years ago

This is certainly possible; I guess the question is whether the value-add justifies the API complexification. Do you feel that it does?

Hmm. Ok, the goal is for users to be able to specify the 'name' metadata field for each population that is output in the tree sequence file. Having to specify a numerical id to identify a population in a tree sequence file is a fairly big source of user error (at least, I do it often enough). It also means that downstream tools, like stdpopsim, require users to consider numerical population ids too. And the tskit metadata improvements will completely overcome this problem---assuming that populations are acutally given human-readable names when the tree sequence is initially created.

OK, if this is a common source of error/difficulty then it does seem to justify the additional complexification.

What the API looks like to do this is less important than being able to do it at all. And if you can do it for the name field, then it makes sense to enable the description field too.

I'm unconvinced here. Unlike the "name" case, the value-add of this description field is not clear to me; this is not a source of bugs/confusion, right? I don't think SLiM/Eidos necessarily needs to support every optional metadata field that tskit supports.

petrelharp commented 3 years ago

I think we have kind of an existing example of how to handle this, with MutationMetadataRec versus MutationMetadataRec_PRENUC

We'd have to do this sort of thing even if we didn't switch to JSON, and switching to JSON with additional properties allowed would reduce the pain for anything else like this in the future.

grahamgower commented 3 years ago

I'm unconvinced here. Unlike the "name" case, the value-add of this description field is not clear to me; this is not a source of bugs/confusion, right? I don't think SLiM/Eidos necessarily needs to support every optional metadata field that tskit supports.

Sure, description is far less important. But there are only two "blessed" fields as far as I remember: name and description.

bhaller commented 3 years ago

It occurs to me that there is an ambiguity here that might make this difficult to fix comprehensively, and that probably also is an issue for #168. That is that SLiM allows the user to reuse subpopulation IDs. You can create a subpop p1, use it for a while, remove it, and then later create another subpop p1. The two are distinct, but have the same name in SLiM. Probably when SLiM writes them out to a .trees file they get conflated into being the same subpopulation, since both will just get mapped to position 1 in the population table. @petrelharp do you think we ever realized this and did anything about it? I don't think we did. Maybe SLiM just shouldn't allow subpop ids to be reused in this way (it's a little fishy), or maybe we need to somehow break the correspondence between "id of the subpop in SLiM" and "index of the subpop in the population table", such that reusing the name "p1" would get you two different entries in the population table (both, I would note, with the name "p1" in the metadata, if we were to fix this issue).

petrelharp commented 3 years ago

do you think we ever realized this and did anything about it?

We did realize it! And decided it was too big a can of worms to deal with at the time - users would have to interpret "population p1 at time 1000" appropriately.

grahamgower commented 3 years ago

Maybe SLiM just shouldn't allow subpop ids to be reused in this way (it's a little fishy)

I would vote for this. My guess is that someone is more likely to do this by accident than do it deliberately. Unless there's a good reason to allow it?

bhaller commented 3 years ago

With the new multispecies design proposal, we will now have both a name ("p1") and a description (i.e., "catalina") for each subpopulation. At present, the multispecies proposal is to call "catalina" the "name" (while "p1" is the symbol/constant for the subpop, and 1 is the "id" for it), so there's potential for confusion there. Subpopulations will also have a "species", which is also a string. So, what to do with this mess? Hmm.

I propose: put SLiM's subpop "name" (i.e., "catalina") into the "name" metadata if a name has been defined; if no name has been defined, then the "name" metadata would get the symbol ("p1") as a fallback. Either way, the id (1) is already available in the metadata, so if the user wants to reconstruct the "p1" symbol rather than the custom name they assigned, that is trivial. We are then left with the "description" field, and I would propose to stuff the species name into that ("fox", "mouse", or by defaultin a single-species model, "sim"). Does this sound reasonable? Note that this proposal would mean that more than one subpop could have the same name and description; SLiM will not guarantee uniqueness for those. The id is a unique identifier, at a single timepoint, but can be reused as discussed above, unless we decide to outlaw that (which is orthogonal to this issue).

Thoughts? @grahamgower happy to send you the multispecies design proposal if you want to be in the loop on that; my assumption has been that you are not interested in it for now. The plan is to completely preserve backward compatibility, so I don't think it will affect you unless/until stdpopsim takes on multispecies modelling. :->

grahamgower commented 3 years ago

I guess I'm missing some context here about the multispecies thing, but it probably doesn't matter so much. How will a user set the name and description fields? If you're going to put the species name in the description field, I'd suggest just not setting the description if there isn't a species name set. Having an empty description should be fine, right?

bhaller commented 3 years ago

The name field would be a new property on Subpopulation which the user could set to whatever they want. SLiM itself would not use it. Methods would be provided to search for subpopulations by name (potentially returning more than one, if the user has chosen to duplicate names). The default name would be the subpop symbol (e.g., "p1") according to my proposal above.

The description field in the metadata would, I propose, come from the species name in SLiM. This is chosen by the user, and used to refer to the species. SLiM would require it to be unique, and to follow standard variable-name syntax rules, because a global constant would be created to refer to each species, just as sim is created to refer to the one species in a single-species mode; in effect, the multi-species model would just generalize outward from the existing single-species model, and sim would become, say, fox and mouse. So there would always be a species name set, and thus there would always be a description – sim would, in fact, be the species name in a single-species model (unless the user defines it to be something else, which they could now do if they wish; sim is just the default).

petrelharp commented 3 years ago

Reminder: here's what's currently in population metdata:

slim_id, selfing_fraction, female_cloning_fraction,
male_cloning_fraction, sex_ratio,
bounds_x0, bounds_x1, bounds_y0, bounds_y1, bounds_z0, bounds_z1,
migration_records.

I am not actually sure what happens if the slim_id does not match the 'tskit id", i.e., which row it is in the table. I think that the first population in the table (with tskit id 0) becomes p0 even if that does not match the slim_id entry, but I'm not sure. This is strange, but I'd argue it's good, as it leaves room for us to later decouple the SLiM and tskit IDs.

I'd vote to add a new entry, which is species to the metadata, leaving description to carry whatever information the user wants to put in (even if that's unused by SLiM; since description is used by other programs for something else, SLiM shouldn't use it for the species name.

I also vote that SLiM requires the population name to be unique, matching what happens in msprime/demes. The fewer interoperability headaches I have to deal with in pyslim, the better.

grahamgower commented 3 years ago

I'd vote to add a new entry, which is species to the metadata, leaving description to carry whatever information the user wants to put in (even if that's unused by SLiM; since description is used by other programs for something else, SLiM shouldn't use it for the species name.

I also vote that SLiM requires the population name to be unique, matching what happens in msprime/demes. The fewer interoperability headaches I have to deal with in pyslim, the better.

I second both of these suggestions. I can't think of any reasonable use case for using the same name more than once, so probably a user would be morely likely to do this by accident (copy/paste without modifying).

bhaller commented 3 years ago

I also vote that SLiM requires the population name to be unique, matching what happens in msprime/demes. The fewer interoperability headaches I have to deal with in pyslim, the better.

I can certainly do that, but I will point out that the idea of allowing names not to be unique was your idea, @petrelharp! You said that you would want to do, for example, a fox-mouse model with multiple subpopulations of each, and that you would want to say that "these foxes and these mice both live on Catalina Island by giving them both the name 'catalina', so that they can find each other easily in order to run interactions." And that seems like a good idea to me, although certainly it could be done in other ways instead. I guess I feel like the "name" is the user's field, and we shouldn't assume how they want to use it or impose restrictions upon their usage. Is that not the intent of the "name" field? Why is msprime imposing restrictions on it? Why would pyslim care about non-unique names?

grahamgower commented 3 years ago

Enforcing uniqueness of names is important to be able to request a population using the name (instead of an integer id). Right now, if you give me a tree sequence made in SLiM, I can't tell what the integer population ids refer to without reading the code of the SLiM script. With names, this becomes very obvious. When names are unique, we can also use the names in the APIs relating to demography and sampling. E.g. change population size in population XYZ instead of saying population 1; or to ask for 10 haploid samples from population XYZ. Names are both more convenient and less prone to error.

petrelharp commented 3 years ago

Hah, well, how about that. I guess I'm still with peter-of-one-hour-ago rather than peter-of-last week: nonunique names would be useful sometimes, but they'd be less useful for other things, and the deciding factor is that msprime/demes requires them, for good reasons as Graham says.

bhaller commented 3 years ago

Enforcing uniqueness of names is important to be able to request a population using the name (instead of an integer id). Right now, if you give me a tree sequence made in SLiM, I can't tell what the integer population ids refer to without reading the code of the SLiM script. With names, this becomes very obvious. When names are unique, we can also use the names in the APIs relating to demography and sampling. E.g. change population size in population XYZ instead of saying population 1; or to ask for 10 haploid samples from population XYZ. Names are both more convenient and less prone to error.

Fair enough; but getting that increasing utility of uniqueness does mean a sacrifice of other types of utility, like being able to easily find the fox and mouse subpops that both live in 'catalina'. I have spoken a bit imprecisely here; the exact intent I had was to require names to be unique within each species, but to allow names to be the same for subpops in different species. That was precisely to support Peter's proposed use case – indeed, I wasn't even planning to put "names" in SLiM at all until he brought up that use case! :->

So. I'm not sure what the right thing to do here is. I think tskit really ought to clarify this. If the intent is for names to be unique, then tskit should declare that as policy and raise an error if it isn't true, right? Whereas if tskit says that "name" can be whatever people want it to be and there is no guarantee of uniqueness, then SLiM, stdpopsim, etc., should follow that policy, right? This feels like it should be decided on the tskit side.

bhaller commented 3 years ago

Hah, well, how about that. I guess I'm still with peter-of-one-hour-ago rather than peter-of-last week: nonunique names would be useful sometimes, but they'd be less useful for other things, and the deciding factor is that msprime/demes requires them, for good reasons as Graham says.

So is that official tskit policy, then? If so, SLiM should certainly follow it. If not, then it's less clear to me. If tskit policy is that the names don't have to be unique, and msprime/demes just happens to follow that policy for its own purposes, then it feels much less clear to me.

grahamgower commented 3 years ago

I don't think tskit says anything about name or description metadata fields for populations. Although maybe it should? These fields are a convention that have been introduced with Msprime 1.0, and we have the same convention in Demes. Actually, there are additional constraints on the name field in Demes and msprime---they also need to be python identifiers. This more or less means that names should start with a letter or an undersore and be followed by letters, numbers, or underscores. Unicode is also allowed, so you can name things π or whatever, but things like spaces and hyphens are explicitly disallowed. There are some good reasons to resitrict to identifiers, because then the names can be used as keywords args in API functions, which provides a very clean API in Python. Python identifiers are identifiers in almost all other programming languages so this isn't overly restrictive (although it does exclude whacky things like R's variables.with.periods)

petrelharp commented 3 years ago

the exact intent I had was to require names to be unique within each species, but to allow names to be the same for subpops in different species.

Ah, right - that I think is OK, and even a good idea, since different species will live in their own tree sequences.

bhaller commented 3 years ago

the exact intent I had was to require names to be unique within each species, but to allow names to be the same for subpops in different species.

Ah, right - that I think is OK, and even a good idea, since different species will live in their own tree sequences.

Aha, sorry – I guess I created the confusion in this thread by speaking imprecisely, then. OK, I think we are agreed? "name" is unique within species, defaults to "p1" etc. but can be set as the user wishes, gets persisted as the name metadata in the tree sequence. "species" is a SLiM concept that does not get persisted in the tree sequence according to the current design. "description" is a tskit-side property that SLiM could add a property for, if we want to, but it sounded above like you guys don't care about it as much. If "description" did get added it would be completely up to the user, no guarantee of uniqueness there at all.

petrelharp commented 3 years ago

I think I'm agreeing with you - species and description should be added as fields to the population metadata.

bhaller commented 3 years ago

Sure, if we're allowed to add our own arbitrary things in there then we can/should put species name there.

petrelharp commented 3 years ago

Sure, if we're allowed to add our own arbitrary things in there then we can/should put species name there.

We sure are, although if we get in the way of what msprime uses (name) then we'll be causing trouble. msprime does not require a specific metadata schema, it only uses name if it is a property of the population metdata. And when making new tables instead of modifying existing ones, it uses a metadata schema that allows other properties to be set besides name and description, something you can only do with the JSON encoder rather than the struct encoder that we use. So, as mentioned above:

we'd avoid some future headaches if we switched population metadata to using JSON instead of a struct, and set additionalProperties=True in the schema. (I'm not suggesting this for other metadata, for performance, though.)

petrelharp commented 3 years ago

And, if SLiM won't be using description for anything, if we switched to JSON then we could not bother adding it to the metadata schema and people could use it anyways.

bhaller commented 3 years ago

And, if SLiM won't be using description for anything, if we switched to JSON then we could not bother adding it to the metadata schema and people could use it anyways.

Hmm, can you expand on this?

petrelharp commented 3 years ago

Part of the metadata schema is telling tskit how to decode the metadata, which is called the codec. The two supported codecs are struct (which we are using) and json, which means just that the metadata is a string of JSON text. When using the json codec, we can say "here are some properties of metadata entries, but entries could have other properties that we haven't specified here". The upshot for SLiM is that the population metadata would be JSON text instead of this strict binary format.

bhaller commented 3 years ago

OK, got that part. The part I didn't get was:

if we switched to JSON then we could not bother adding it to the metadata schema and people could use it anyways

Are you saying that we wouldn't need to add a description property to Subpopulation, then? That people would just set a description on the Python side if they wanted one?

petrelharp commented 3 years ago

Are you saying that we wouldn't need to add a description property to Subpopulation, then? That people would just set a description on the Python side if they wanted one?

Exactly.

bhaller commented 2 years ago

OK, I just reread all of the above. As I understand it, here's what we settled on:

convert subpop metadata to JSON and increment the version number to match
add a "name" property to Subpopulation in SLiM, which will default to "p1" etc., but which is modifiable
require "name" to be unique within a species (i.e., model-wide, for now), including back in time
write the "name" property into the JSON metadata under key "name"
do NOT add "description", as either a property or metadata, for now since it sounds like that's not needed (?)
"species" does not get added to the metadata until I implement multispecies (if then)
I'm inclined not to enforce the "Python identifier" constraint on "name"; if people want SLiM to play nice with msprime/Demes they can follow that constraint, but it doesn't seem like SLiM needs to require it (?)

The only tricky bit is the possibility of p1 getting used more than once. Since name is generated from p1 etc. by default, that would mean than one would have the potential for non-unique names. I think I can finesse that, though. I can keep a record of all names that have been used in the past, and if the user makes a second p1 after the original p1 has ceased to exist, I can notice the name collision and make the default name for the second p1 be something like p1_2 and count upwards from there as needed. A little ugly, but (a) it doesn't break backward compatibility, and (b) the user can always explicitly set a name if they want to – or change their code to not reuse p1. So I think that policy should solve the problem.

So, there are two items above with (?); if either of you disagrees on one of those, please reply here.

The relevant code is in SLiMSim::WritePopulationTable(). I have lots of questions about how that would get changed:

We generate empty entries for unused subpop ids; but we seem to get the metadata for them from tsk_population_table_get_row(). What if that existing metadata is not in JSON? Do we need to get into translating from struct to JSON somehow?? And what if that existing metadata doesn't have a name in it; do we have to generate one? And what if it has a name, but the name there conflicts with one of the name values that we plan to write out ourselves; do we have to modify it to be unique? This seems like a big mess, potentially.
For the "real" rows, we make a SubpopulationMetadataRec and pass a pointer to it, and a length, to tsk_population_table_add_row(). What does that look like if we want to write out JSON instead? Do we pass a pointer to a C string contianing the JSON, and the length of that C string? Is the C string NULL-terminated, or is the NULL omitted because the length suffices?
Can you write the schema string that ought to be used for this? And where does that schema string get declared/defined/set?
And then WritePopulationTable() writes out empty entries for the rest of the table. These presently get metadata of NULL; but a comment in the code notes that "there could be ancestral nodes that reference them", so it seems like they need to get a proper name value instead – which seems like it kinda means that we need to write out a full metadata record for them, no? And that's a big can of worms, because those subpops don't exist any more, so we don't have the information about them that we would normally have – migration, spatial bounds, selfing/cloning rates, etc. Can we skip writing out metadata for these subpops – i.e., stick to NULL? Or does every subpop in the pop table now need to have a valid/unique name in its metadata, for this to work?

So, this is very far from simple. If all the above questions/issues can be clarified, I can give it a try; but I'll definitely need substantial attention from you on it @petrelharp. Better, if you understand all of the issues involved here, would be if you were to tackle implementing this. If you like, I can do the more SLiM-side work of adding the name property to Subpopulation, auto-generating the values for that property from p1 etc., and guaranteeing that the names that SLiM uses are unique. I can probably do that today if you like. :-> But the rest of it is pretty opaque and feels like a big can of worms.

bhaller commented 2 years ago

See #173 for a related issue; both should be fixed at the same time.

grahamgower commented 2 years ago

OK, I just reread all of the above. As I understand it, here's what we settled on:

convert subpop metadata to JSON and increment the version number to match

add a "name" property to Subpopulation in SLiM, which will default to "p1" etc., but which is modifiable

require "name" to be unique within a species (i.e., model-wide, for now), including back in time

write the "name" property into the JSON metadata under key "name"

do NOT add "description", as either a property or metadata, for now since it sounds like that's not needed (?)

Presumably this could be added at a later date if desired?

"species" does not get added to the metadata until I implement multispecies (if then)

I'm inclined not to enforce the "Python identifier" constraint on "name"; if people want SLiM to play nice with msprime/Demes they can follow that constraint, but it doesn't seem like SLiM needs to require it (?)

Sure, this isn't essential nor will it be enforced in tskit. Maybe the docs could suggest avoiding spaces and hyphens though? E.g. for maximum interoperability names should start with a letter and be followed by zero or more letters numbers or underscores.

The only tricky bit is the possibility of p1 getting used more than once. Since name is generated from p1 etc. by default, that would mean than one would have the potential for non-unique names. I think I can finesse that, though. I can keep a record of all names that have been used in the past, and if the user makes a second p1 after the original p1 has ceased to exist, I can notice the name collision and make the default name for the second p1 be something like p1_2 and count upwards from there as needed. A little ugly, but (a) it doesn't break backward compatibility, and (b) the user can always explicitly set a name if they want to – or change their code to not reuse p1. So I think that policy should solve the problem.

Sounds reasonable.

So, there are two items above with (?); if either of you disagrees on one of those, please reply here.

The relevant code is in SLiMSim::WritePopulationTable(). I have lots of questions about how that would get changed:

We generate empty entries for unused subpop ids; but we seem to get the metadata for them from tsk_population_table_get_row(). What if that existing metadata is not in JSON? Do we need to get into translating from struct to JSON somehow?? And what if that existing metadata doesn't have a name in it; do we have to generate one? And what if it has a name, but the name there conflicts with one of the name values that we plan to write out ourselves; do we have to modify it to be unique? This seems like a big mess, potentially.

Probably just raise an error for non-json metadata for now. I guess support could be added later if someone really wants it?

Msprime is using names of the form pop_<j> for population with index j (for when the user doesn't specify names). I think generally there will be a name for all populations, or no names at all (e.g. from old msprime versions). So having name conflicts in SLiM after reading in a tree sequence will most likely arise with tree sequences previously generated by SLiM. I can see this arising in valid code, but also arising when the user misspecified something, and I'm not sure which will be more common. Maybe just raising errors in these corner cases makes the most sense?

For the "real" rows, we make a SubpopulationMetadataRec and pass a pointer to it, and a length, to tsk_population_table_add_row(). What does that look like if we want to write out JSON instead? Do we pass a pointer to a C string contianing the JSON, and the length of that C string? Is the C string NULL-terminated, or is the NULL omitted because the length suffices?

I think NULL-terminating a string is always a good idea, even when there is a length field.

Can you write the schema string that ought to be used for this? And where does that schema string get declared/defined/set?

The msprime schema is here (you could just delete the "description"): https://github.com/tskit-dev/msprime/blob/36139f25f0bb9c399b3b162210483354a4e0afd9/msprime/demography.py#L1041-L1055

Sorry, I haven't really used the tskit C API, so can't be of further help.

bhaller commented 2 years ago

Hi @grahamgower,

Generally all sounds good. A few specific replies; agreed on all points not replied to:

We generate empty entries for unused subpop ids; but we seem to get the metadata for them from tsk_population_table_get_row(). What if that existing metadata is not in JSON? Do we need to get into translating from struct to JSON somehow?? And what if that existing metadata doesn't have a name in it; do we have to generate one? And what if it has a name, but the name there conflicts with one of the name values that we plan to write out ourselves; do we have to modify it to be unique? This seems like a big mess, potentially.

Probably just raise an error for non-json metadata for now. I guess support could be added later if someone really wants it?

Yes; but how does that code even tell whether the existing metadata is JSON or not? All it has is a pointer to metadata and a length. Does it need to consult the schema for the metadata for that table, or something? This is more @petrelharp's area, I don't really grok this metadata stuff very well. If we're going to raise an error when the incoming metadata is not JSON, we shouldn't do it in that spot in the code, we should do it when we first load in the .trees file. And maybe we should be checking that metadata on other tables is in the format we expect, too? @petrelharp would be good to have guidance on this.

...Maybe just raising errors in these corner cases makes the most sense?

Yeah, that's a good point; I should be thinking less in terms of trying to correct conflicts in the names, etc., and more in terms of just raising errors. If the user is creating conflicting names, that's something they can and should fix.

I think NULL-terminating a string is always a good idea, even when there is a length field.

This is one where I bet tskit has a policy, and we ought to follow it to avoid surprises/inconsistencies. :->

The msprime schema is here...

Aha, that is helpful, thanks!

bhaller commented 2 years ago

Note to self: when this gets fixed, revise the metadata description in the SLiM manual.

petrelharp commented 2 years ago

I know there's more questions up there, but this is the one I've currently got the brainpower for:

Yes; but how does that code even tell whether the existing metadata is JSON or not? All it has is a pointer to metadata and a length. Does it need to consult the schema for the metadata for that table, or something? This is more @petrelharp's area, I don't really grok this metadata stuff very well. If we're going to raise an error when the incoming metadata is not JSON, we shouldn't do it in that spot in the code, we should do it when we first load in the .trees file. And maybe we should be checking that metadata on other tables is in the format we expect, too? @petrelharp would be good to have guidance on this.

When reading a file we can ask whether to expect JSON by parsing the individual table's metadata schema (which is itself JSON) and seeing if the "codec" entry says "json".

Also, in case it's helpful, about writing out the metadata - the C API doesn't mess around with decoding/verifying metadata - the metadata column in the population table is a "ragged" column, so you can shove whatever stuff of varying lengths in there you want. When the tree sequence is read by python you'll get errors when accessing metadata if (a) you said to expect json in the metadata schema and (b) whatever is in there isn't valid json. But tsk_population_table_add_row( ) does not. So, tsk_population_table_add_row( ) will only read the first metadata_length bytes of metadata; if your string is null-terminated then I think the length you want is the length that excludes that null at the end.

bhaller commented 2 years ago

@hyanwong notes:

Also, when recapitating the resulting TS, I then get

python3.9/site-packages/msprime/demography.py:1097: IncompletePopulationMetadataWarning: The metadata schema does not have a 'name' property; population names will not be recorded in the output tree sequence

So the lack of a name is now logging on the Python side. One more reason to fix this!

bhaller commented 2 years ago

Been pondering this today. Sorry for the long post; this issue has gotten really long and complicated.

Most of my puzzlement at this point surrounds the metadata stuff. Right now, in SLiMSim::_InitializePopulationFromTskitBinaryFile() we call tsk_table_collection_load() to read in the .trees file, and then we get busy creating SLiM objects to correspond to the tree sequence we loaded. As far as I can tell, we just inherit whatever metadata schemas happen to be in the tree sequence we load; I don't see any checks that they are what we expect them to be. When we try to create SLiM objects we parse the metadata in some of the tables to get info like mutation selection coefficients, etc., but the only check we do on that metadata is that the length makes sense to us – that it's the right number of bytes. Then we start running the model forward, which means that we're writing new entries into the tables with associated metadata, and we write that in our own SLiM metadata formats of course – again without ever checking that the metadata schema for the table actually matches our own. Finally, in SLiMSim::WriteTreeSequenceMetadata(), we simply set the metadata schema on each table in the table collection to our own schema, replacing whatever might be there previously – even though there might be some metadata, in the tree sequence we originally loaded, that is actually in a different schema, which we never actually checked for or did anything about.

I guess we can get away with this kind of fast-and-loose behavior because we require that the .trees file is in SLiM format; we check for a SLiM file version number in the top-level metadata, and if it's not there we raise an error, and if it is there we take it as a contract that things are the way they're supposed to be for that file version. So we're not in the business of reading arbitrary .trees files, from msprime or whatever; we only read ones in our format, and that comes with a guarantee that metadata is in the format we expect, and if it isn't, that's Somebody Else's Problem.

So I've been wondering: if the population metadata schema says the metadata is JSON, rather than binary, do we need to validate the schema, etc.? What do we do if it has extra keys we don't know? Those keys won't be described by our schema, so when we replace the schema with our own at write time, those keys will then be non-conformant with the schema as written, as I understand it. What do we do if the schema read in is missing keys we expect it to have? Do we need to go through the schema and confirm that it lists all the keys that we consider "required"? What about keys we consider to be "optional"? What if it lists a key as "required" that we consider "optional", or vice versa? Etc.

Maybe this is all Somebody Else's Problem – i.e., pyslim's problem. Maybe, just as with binary metadata, the SLiM file version is a contract, and we can just assume that that contract is fulfilled and go on our merry way. But that defeats the purpose of JSON metadata, doesn't it? The whole point is that it should be flexible – different software writes different keys and uses different schemas, etc., right? But if we're going to play nice in that way, then the SLiM file version is no longer a contract – anything goes, and we have to worry about working with schemas other than our own, validating that they have what we need, carrying over extra stuff faithfully, creating a new schema that merges the keys from the loaded file with the keys that we provide, ensuring that there are no conflicts in that merge process, thinking about required vs. optional keys and how they get merged, and so on. It seems very complicated. Does tskit provide any help with this, or is this logic all on the client? What does other software, like msprime, do about these schema issues?

But OK, maybe I'm overcomplicating things, and for now the proposal is that the SLiM file version is still a contract, and the metadata schema for all the tables we're interested in still has to be an exact match. If that's true, then I think I understand how to proceed. I think I'd start by adding explicit schema checks, at .trees load time, for exact matches to our schema strings, so that we guarantee conformance instead of just assuming it; assuming it makes me nervous, feels like bugs just waiting to happen. With checks for exact schema matches, everything downstream can just assume exact conformance, which makes everything simple.

Even if that's the plan, though, it seems like this will create some difficult problems for pyslim, and I wonder whether there's a clear plan for resolving them. If somebody wants to convert an msprime tree sequence to SLiM format, the input ts from msprime has name and description, but according to the plan above, at least for now we may not add description on the SLiM side. So... description has to be removed, because the schema for the SLiM tree sequence has to be exactly conformant? Even optional keys are supposed to be declared in the schema, right? I mean, you're not supposed to just leave description in the JSON metadata but remove the description key from the schema, right? What do you plan to do about that sort of thing?

All these kinds of considerations just have my head spinning and I don't know where to start because I don't have a clear vision of what I'm implementing.

grahamgower commented 2 years ago

this issue has gotten really long and complicated

:(

I think your plan of being very restrictive is probably a sensible first step. This of course assumes that it's easy enough to relax a given restriction in the future as use cases arise.

hyanwong commented 2 years ago

I suspect @benjeffery would be best placed to comment here, as he implemented most of the metadata stuff. But I think that for JSON schemas (but not for struct ones), you can have optional keys that are not declared in the schema.

It sounds like tskit should provide a function that checks a schema for compatibility. That is, if there is a SLiM json schema for each table, it should be possible to check that the loaded tree sequence has schemas for each table which are compatible with the relevant SLiM schema[*], in that the loaded schema will allow SLiM structured data to be written to it. Perhaps the allOf functionality of JSON schema would be useful here? If the loaded schema is not compatible then, SLiM could fail immediately, or perhaps issue a warning that the metadata will be ignored. Then SLiM could output a tree sequence with the same schema as the one that was read in, but SLiM stuff could still be added to the metadata, although it might not be documented in the schema.

The alternative is for there to be a tskit function to merge compatible schemas, although this sounds complicated to me. As I said, @benjeffery 's the expert here though, and I might be missing something vital.

[*] and perhaps allow loading tree sequences with no schema and no metadata

bhaller commented 2 years ago

I think your plan of being very restrictive is probably a sensible first step. This of course assumes that it's easy enough to relax a given restriction in the future as use cases arise.

Yes, I'm thinking that for now that I ought to implement rigid conformance to the expected schema based upon the file version, and if anybody has a problem with that they can squawk and then we will think about how to relax that requirement appropriately. It seems like it pretty much removes all the intended benefits of JSON schemas, to do this, but the issues involved with allowing flexibility in the JSON schemas feel rather daunting and unknown right now, and I just want to get this (deceptively simple!) issue fixed for 3.7 without getting bogged down in minutiae that, probably, nobody actually even cares about right now. :-> So I'll do that, and produce a PR for it, and then if changes are needed to that PR we'll tackle that next. :->

petrelharp commented 2 years ago

Small note:

make the default name for the second p1 be something like p1_2 and count upwards from there

How about p1_<generation>. That could always be the default, in fact, no need to check for uniqueness, if you don't want to make names unique within SLiM. (However: if you are decoupling the index in "p1" from the index in the population table, there's a can of worms related to reading files back in to SLiM.)

petrelharp commented 2 years ago

Re: metadata schemas. Good point! Hm. Thoughts:

All this applies only to population-level metadata; we're keeping everything else the same. We could check if the existing metadata schema in the tree sequence is compatible with SLiM's (with a metaschema, for instance? or just check for equality-of-dicts?). Changing schema in different versions makes this a bit tricky, so I'm inclined to not do this check.

For population metadata, we could do a very simple metadata schema - it could say

"this is json; anything goes" or
"this is json; it must (be null or) have X Y and Z but then anything else goes".

Note that (1) is how top-level metadata works.

In SLiM we're not using the metadata schema for anything, so it kinda doesn't matter. We just try to read X Y and Z from population metadata, and throw an error if we can't. The downside of (1) is that maybe it's easier for people to set up a tree sequence that SLiM can't read by forgetting required things. I can't really think of any downsides of (2), as long as we don't require that the metadata schema says that X Y and Z are required, only that they're actually present when we need them.

So - my proposal is that SLiM would output population metadata as json; the metadata schema would be sensible for what we output (option (2) above), but we don't worry about validating metadata schemas, we just check whether the stuff we're looking for is there and error if it isn't.

petrelharp commented 2 years ago

Ah, right - and what about merging schemas? It wouldn't be too hard to union schemas (you've just got to union the properties properties), but we could also just overwrite with our own, erroring if existing metadata is not encoded as json, knowing that any other json set-up would be legal under our metadata schema.

bhaller commented 2 years ago

Small note: ...

Actually: in the course of fixing another of these tree-seq issues recently, we now force subpops not to be re-used when tree-seq is enabled. You can't use p1, kill p1, and then make a new p1, if you are using tree-seq. So this is moot, I think.

I do wonder whether I should just broaden that to be a general policy, for consistency.

MesserLab / SLiM

Add `name` and `description` to population metadata #169