ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
523 stars 111 forks source link

Preprocess :start-end samtools-style path ranges in pangenome input #1319

Closed glennhickey closed 7 months ago

glennhickey commented 8 months ago

cactus-pangenome will crash if any of the input fasta contigs have a suffix like :10-100 denoting a subpath range (a fairly standard annotation that, ex, samtools faidx uses -- note it's 1-based end inclusive). This seems to be because these types of subranges are already being used in GAF path steps coming out of minigraph, which confuses a parsing step in Cactus.

Not related, but there is already logic to convert subranges to _sub_9_100 (0-based exclusive end) going into pangenome HALs because the genome browser can't (couldn't?) handle : and/or -.

This PR just bumps this existing logic forward, paths of the form chr1:10-100 will get converted right away to chr1_sub_9_100 in cactus_sanitizeFastaHeaders -- the end result being 1) the crash is fixed while 2) the subrange information is correctly preserved through to the output.

resolves #1287

glennhickey commented 8 months ago

In light of this comment which shows the crash happens even if the colon isn't denoting a subrange, I added a global conversion of : to _ for contigs coming into cactus_pangenome. The subrange logic mentioned above is still applied, but any stray colons beyond that are now turned to _.