MesserLab / SLiM

SLiM is a genetically explicit forward simulation software package for population genetics and evolutionary biology. It is highly flexible, with a built-in scripting language, and has a cross-platform graphical modeling environment called SLiMgui.
https://messerlab.org/slim/
GNU General Public License v3.0
160 stars 33 forks source link

encode which individuals are being output in a VCF somehow #43

Closed petrelharp closed 5 years ago

petrelharp commented 5 years ago

As discussed in #42, it ought to be possible to know from the VCF which individuals have been output. This only makes sense when individuals actually have unique identifiers, which is I think only if they have pedigreeIDs.

There, Ben said:

So assuming that works, that gets you identifiers on the .trees side. On the VCF side, it's a good idea for each genome in the VCF file to be annotated with the same identifiers, but that presently is not done. Where would such per-sample information typically be put, in a VCF file? I'm looking at the VCF 4 spec and not seeing an obvious spot for per-sample annotations, which perhaps is why I didn't already do this (but probably I'm just missing the obvious). :-> If you can suggest a good way for it to be encoded in the VCF, I could check that improvement in on GitHub within a day or two.

The last line in the header has the sample IDs, which you currently fill out like i0 i1 i2 .... These could be replaced with like i... if these are defined. This would be a good idea, I think.

That would still provide only the individual IDs, whereas it is likely the genome IDs that people want to get at.

Hm, I'm not sure about that. Most properties are individual-based (e.g., phenotype) and in both SLiM and in the tree sequence you can get the genome IDs from the individual IDs.

bhaller commented 5 years ago

I've just committed a fix for this to the master branch. Note that VCF output now contains the genome pedigree IDs for each output genome, but does not contain the individual pedigree IDs. This is because in the general case in SLiM there is no guarantee that the two genomes assembled into a diploid "sample" for purposes of VCF output actually come from the same individual, so in the general case the individual IDs for the "samples" are not in fact well-defined. In many cases they do happen to be consistent, but there is no guarantee that that is the case, and it would be confusing if the individual pedigree IDs were sometimes provided but sometimes missing. Better to just provide the genome pedigree IDs, which are always well-defined; the user can easily transform them into individual pedigree IDs if desired.