HighlanderLab / tree_seq_pipeline

Pipeline to infer tree sequences with different datasets
MIT License
3 stars 7 forks source link

How to handle chromosome names? #41

Open janaobsteter opened 10 months ago

janaobsteter commented 10 months ago

Currently, we get chromosome numbers from the config file - and then we define a loop over range(1:nChromosomes+1). But what if we have non-numeric chromosomes in there, like other contigs or mitochondrial genome?

gregorgorjanc commented 10 months ago

Maybe follow what stdpopsim does?

hannesbecher commented 9 months ago

I think it would be useful to have a text file with chromosome names and lengths. See the genome file format used by bedtools. This has one chromosome per line, a tab, and the chromosome's length:

$ cat my.genome
chr1  1000
chr2  500

Should a genome file be generated as part of this pipeline?

This would be easy if the entry point was one multi-chromosome VCF file. The file could be parsed and each chromosome's highest variant position could be used as the chromosome length. It would also be easy if a genome FASTA file was available.
But it could be tricky if the entry point is multiple VCF files.

Alternatively, we might require the genome file as an additional input, and we could supply a script to generate such a file from VCF/genome FASTA.

Opinions? @gregorgorjanc @gmafrafortuna @janaobsteter

Generally, Stdpopsim sounds good, but we may want to run this pipeline also on small test datasets and organisms that are not on stdpopsim ATM?