ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
526 stars 111 forks source link

Add checks for invalid sequence characters when writing temporary FASTA files #1529

Closed glennhickey closed 3 days ago

glennhickey commented 3 days ago

There seem to be gremlins in Cactus causing invalid FASTA files. In cactus-pangenome, this has manifested as slightly truncated files being imported into chromosome alignment jobs. And in progressive Cactus, there are a couple issues, #1466 #1525, reporting corrupt FASTA chunks (missing newline?) going into lastz.

I still don't know what the underlying problem is, though on the pangenome side it really looks like the corruption is happening at the filesystem level.

This PR just adds some asserts to try to catch these errors a little earlier to (hopefully) make debugging if / when they come up again.