ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

Don't name-munge underscores in maf/taf processing #1425

Open glennhickey opened 5 days ago

glennhickey commented 5 days ago

Because some MAF tools use . as a special delimiter to separate species and contig names, care must be taken to make sure that species names themselves don't contain .s (which they often due now that we're using accessions) when going through some MAF and TAF tools.

I've got a case now that seems weirdly slow, though:

2283678 hickey    20   0   19256  17680   1896 R  76.3   0.0  38:34.11 sed -f alnum_to_genome.sed                                       
2281905 hickey    20   0   19256  17668   1896 R  74.6   0.0  47:37.07 sed -f alnum_to_genome.sed                                       
2286745 hickey    20   0   44860  43308   1944 R  71.2   0.0  11:46.78 sed -f genome_to_alnum.sed 

No idea why sed is so slow, but it's doing this because the names have underscores when there really is no need. So this just lets underscores go through.

This logic probably needs revising in the future, since it seems like a real dumb potential bottleneck.