bodkan / slendr

Population genetic simulations in R 🌍
https://bodkan.net/slendr
Other
54 stars 5 forks source link

Population names need to be valid python identifiers, but `population()` does not require this #133

Closed IsabelMarleen closed 1 year ago

IsabelMarleen commented 1 year ago

The msprime backend requires that all population names are valid python identifiers (only letters, digits and underscores), but slendr does not check for this. It also does not seem to be included in the documentation. Only at the msprime call an error is thrown. I suggest adding this restraint to the documentation and including an error in the population function for names that do not meet the requirements.

Minimal example:

library(slendr)
init_env()
anc <- population("L.africana", time = 9e6, remove=8e6, N=100)
model <- compile_model(populations = list(anc), generation_time=1)
msprime(model, sequence_length = 10, recombination_rate = 0)

The last line produces the error:

ValueError: A population name must be a valid Python identifier

bodkan commented 1 year ago

Wow, I had no idea about this. I thought the names are simply normal strings? I wonder why is that?

In any case, thank you for the report! I'll get this sorted right away. I guess I just have to figure out the correct regex for this...

bodkan commented 1 year ago

It's a bit of a shame though. I want proper names like "L. africana" be valid.

We could internally mangle the names just so that msprime doesn't complain, keeping the real names as metadata stored with the tree sequence. That's how it works with SLiM (no problem there).

That would require tracking another column in the population/gene flow/resize/interaction/etc. tables. I'd have to think about this more before I start making those changes.

@IsabelMarleen: For the time being, and for your project, please name the elephant species as "L_africana", "Lafr", or anything that clearly identifies them.

bhaller commented 1 year ago

Hi @bodkan. For what it's worth, SLiM's policy at present is expressed in this comment, in Subpopulation::SetName():

    // we suggest in the doc that subpopulation names ought to be Python identifiers, for maximum interoperability, but
    // we do not enforce that here since tskit itself does not require this; this is more of an msprime/Demes thing

But since you have an msprime back end, you're more tightly integrated with it than we are. I think they do have good reason for this restriction on their end, since (IIRC) it allows users to refer to subpops directly in Python code using the symbolic name that has been set. Somebody, probably @petrelharp, explained it to me back when I encountered this issue. But it does seem like you could just suggest this restriction, not enforce it (for users that only use the SLiM back end anyway); or warn, but not error, if the user does it; or mangle non-compliant names if necessary. It is indeed nice to be able to have names like "L. africana"; but then again, it'd be nice sometimes to be able to have variable names like that when programming, sometimes, and we've all been conditioned to accept that we can't! :->

bodkan commented 1 year ago

Oh yeah, 100% understand the situation with Python + msprime.

After all, even slendr populations are stored in normal R variables (which follow almost the same restrictions as Python, except that R allows the weird oh.hello.I.am.a.proper.R.variable.how.cool.is.that.eh :)). And, in every slendr code I've written so far, the variable names do fit the "symbolic name" anyway.

Thanks for your thoughts on this! I'll try to think up a reasonably practical solution.

I like the idea of issuing a warning! Although something tells me that "just write a regex for what is a valid Python identifier" might be a hard problem, if one wants to really cover every possible scenario. Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. and all that. :)

bhaller commented 1 year ago

Oh, I'm sure you can just look up the correct regex in Python's parser code. (Now you have three problems! :->)

bodkan commented 1 year ago

To avoid similar issues, any slendr population name now must also pass as a valid Python identifier.

Not entirely sure if the implemented solution captures 100% of valid (and invalid) cases but it seems to cover the most obvious cases.

As an example, all of these are still valid slendr population names because they are also valid Python identifiers:

list(
  "valid_identifier",
  "ValidIdentifier",
  "_another_valid1",
  "identifierWithÜmlaut",
  "αλφαβητικός",
  "متغير_عربي",
  "변수_한글"
)

These are now considered invalid (caught already by a population() call, so the msprime() call doesn't crash anymore):

list(
  "1invalid_identifier",
  "identifier-with-hyphen",
  "αλφα-βητικός",
  "3متغير_عربي",
  "123변수_한글"
)
bhaller commented 1 year ago

I decided to follow SLiM's example and require that a slendr population name must also be a valid Python identifier.

Hi @bodkan. This decision seems fine to me, but you're not following SLiM's example, just to be clear. See what I wrote above; in SLiM it is just a suggestion, and is not enforced (I think :->).

bodkan commented 1 year ago

Apologies for the sloppy update message! I edited it now, removing the implication that SLiM does something similar.