There seem to be gremlins in Cactus causing invalid FASTA files. In cactus-pangenome, this has manifested as slightly truncated files being imported into chromosome alignment jobs. And in progressive Cactus, there are a couple issues, #1466 #1525, reporting corrupt FASTA chunks (missing newline?) going into lastz.
I still don't know what the underlying problem is, though on the pangenome side it really looks like the corruption is happening at the filesystem level.
This PR just adds some asserts to try to catch these errors a little earlier to (hopefully) make debugging if / when they come up again.
cactus_sanitizeFastaHeaders now checks the sequence in addition to headers and reports an error if a non-ACGTN character is found.
faffy chunk and faffy extract changed to check (with an assertion) that they only write valid sequence characters. Because in #1466 it looks like the invalid sequence is coming out of faffy chunk.
There seem to be gremlins in Cactus causing invalid FASTA files. In
cactus-pangenome
, this has manifested as slightly truncated files being imported into chromosome alignment jobs. And in progressive Cactus, there are a couple issues, #1466 #1525, reporting corrupt FASTA chunks (missing newline?) going intolastz
.I still don't know what the underlying problem is, though on the pangenome side it really looks like the corruption is happening at the filesystem level.
This PR just adds some asserts to try to catch these errors a little earlier to (hopefully) make debugging if / when they come up again.
cactus_sanitizeFastaHeaders
now checks the sequence in addition to headers and reports an error if a non-ACGTN character is found.faffy chunk
andfaffy extract
changed to check (with an assertion) that they only write valid sequence characters. Because in #1466 it looks like the invalid sequence is coming out offaffy chunk
.