Request: Explain populations!(::Vector) better

BioJulia / PopGen.jl

Population Genetics in Julia

https://biojulia.github.io/PopGen.jl/

MIT License

47 stars 16 forks source link

Request: Explain populations!(::Vector) better #92

Closed jakobnissen closed 3 years ago

jakobnissen commented 3 years ago

It's quite unclear what it actually does. Currently, it says:

Vector of new unique population names in the order that they appear in the PopData.meta

, which all of my students took to mean that the input should be a vector of the same length as the dataframe, such that the n'th entry in the vector became the name of the n'th sample in the dataframe.

Two suggestions for making it better:

Do not rename missing values implicitly
Be more explicit about precisely what it does

Thanks for otherwise great docs!

pdimens commented 3 years ago

Thanks for asking for clarification, I'm always interested in improving the docs! It's too bad 0.7.0 isn't ready in time for your students 😕.

The method in question can be thought of as replacing the pool of pop names. This isn't the best method bc it's contingent on meta being sorted, which isn't always the case. So, if unique(pdata.meta.population) gives you, say, 3 elements, this method has you input a vector of length 3 to replace those pops. I'm on my phone, so I can't check it fully, but I believe internally the vector is used to create a dictionary of old => new and run the dictionary method, which is the preferred. The docs will be amended for 0.7.0 to reflect this. I'll keep this issue open until the release.

If your students find anything else, please open more issues!

pdimens commented 3 years ago

Not that it's helpful now, but 0.7.0 checks the length of the input vector for that method and decides whether to replace the unique values (like I explain above) or the values per sample (as you've described). I'll make sure to exhaustively document that.

To replace or ignore missing, the recommended way is here, which is populations!(PopData, samplenames, samplepops). This method takes a vector of sample names and a vector of their new population ID's.

pdimens commented 3 years ago

0.7.0 is pending release, and as soon as it's merged into the General Registry, I will rebuild the docs with the updates we've discussed. Thanks :smile: