ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
518 stars 111 forks source link

stderr=FAILURE: bad fasta character #1466

Open macmanes opened 2 months ago

macmanes commented 2 months ago

Hi All - having many jobs fail with this type of error, which seems to indicated perhaps something about a poorly formatted fasta file? One issue is that I can't seem to find the offending file GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa. Not sure if this is related to some upstream error that was not handled properly.

=========>                                                                                                                                                                                                                                              
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---                                                                                                                                                          
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host node146.rcchpc.                                                                                                  
        [2024-08-20T08:17:36-0400] [MainThread] [I] [toil.worker] Working on job 'run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10                                                                                                                     
        [2024-08-20T08:17:37-0400] [MainThread] [I] [toil.worker] Loaded body Job('run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10) from description 'run_lastz' kind-run_lastz/a/instance-f8hw5ee6 v10                                               
        [2024-08-20T08:17:38-0400] [MainThread] [I] [toil.statsAndLogging] For distance 0.022243274 for genomes files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-6e5a082d6ee9443e87f430a098014ee4/5.fa, files/for-job/kind-make
_chunked_alignments/instance-2auibmvz/cleanup/file-18a0b457ba4041949eb618d520e4aebf/5.fa using --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition lastz parameters                                                                          
        [2024-08-20T08:17:38-0400] [MainThread] [I] [cactus.shared.common] Running the command ['lastz', 'GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace]', 'sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace]', '--form
at=paf:minimap2', '--step=2', '--ambiguous=iupac,100,100', '--ydrop=3000', '--notransition']                                                                                                                                                            
        [2024-08-20T08:17:38-0400] [MainThread] [I] [toil-rt] 2024-08-20 08:17:38.327880: Running the command: "lastz GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace] sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace] 
--format=paf:minimap2 --step=2 --ambiguous=iupac,100,100 --ydrop=3000 --notransition"                                                                                                                                                                   
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:                                                                                                                                      
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-6e5a082d6ee9443e87f430a098014ee4/5.fa' to path '/tmp/toilwf-fffa0892c
a3c520588bda60e7418db98/93ae/job/tmphkrn84lv/GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa'                                                                                                                                                                      
        [2024-08-20T08:17:38-0400] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-2auibmvz/cleanup/file-18a0b457ba4041949eb618d520e4aebf/5.fa' to path '/tmp/toilwf-fffa0892c
a3c520588bda60e7418db98/93ae/job/tmphkrn84lv/sandy.combined.contigs.arrow.purged_5.fa'                                                                                                                                                                  
        [2024-08-20T08:17:38-0400] [MainThread] [C] [toil.worker] Worker crashed with traceback:                                                                                                                                                        
        Traceback (most recent call last):                                                                                                                                                                                                              
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/worker.py", line 438, in workerScript                                                                                                            
            job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)                                                                                                                                                            
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 2984, in _runner                                                                                                                   
            returnValues = self._run(jobGraph=None, fileStore=fileStore)                                                                                                                                                                                
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 2895, in _run                                                                                                                      
            return self.run(fileStore)                                                                                                                                                                                                                  
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/toil/job.py", line 3158, in run                                                                                                                       
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)                                                                                                                                                                       
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/cactus/paf/local_alignment.py", line 67, in run_lastz                                                                                                 
            segalign_messages = cactus_call(parameters=lastz_cmd, outfile=alignment_file, work_dir=work_dir, returnStdErr=gpu>0, gpus=gpu,                                                                                                              
          File "/mnt/gpfs01/software/anaconda/colsa/envs/cactus-2.9.0/lib/python3.8/site-packages/cactus/shared/common.py", line 910, in cactus_call                                                                                                    
            raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))                                                                                                                                          
        RuntimeError: Command ['lastz', 'GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa[multiple][nameparse=darkspace]', 'sandy.combined.contigs.arrow.purged_5.fa[nameparse=darkspace]', '--format=paf:minimap2', '--step=2', '--ambiguous=iupac,100,100', '--ydr
op=3000', '--notransition'] exited 1: stderr=FAILURE: bad fasta character in GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa, >id=GCF_011100685.1_UU_Cfam_GSD_1.0|NC_049228.1|81081596|0 (greater than sign ">")                                                   
        remove or replace non-ACGTN characters or consider using --ambiguous=iupac                                                                                                                                                                      

        [2024-08-20T08:17:38-0400] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host node146.rcchpc                                                                                                                     
<=========                                                                                           

Any help here greatly appreciated!

macmanes commented 2 months ago

Note searching the original file for illicit characters does not return anything..

grep -v '^>' GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.fna.masked | grep -o '[^ATCGactgNn]'

glennhickey commented 2 months ago

That's a new one. Given the error says (greater than sign ">") perhaps it's an empty sequence names that's triggering it? Cactus does its own check for non-agctn characters well upstream of this, so it's likely something less simple.

You should be able to confirm by creating a work directory (I'm using ./work below) and rerunning the failing command with these flags added:

--restart --caching false --cleanWorkDir never --workDir ./work

then pull the relevant fasta files being input into lastz out of find ./work and inspect them yourself.

macmanes commented 2 months ago

yup so there are some issues with some of the fasta files - but not empty headers.

1st lines of one of the offending fasta files (which I wrapped to make parsing easier) - looks good here.

Line 1: >id=GCF_009873245.2_mBalMus1.pri.v3|NC_045787.1|171266408|90000000
Line 2: ctttaatccattttgagtttatttttgtgtgtggtgttaggaagtgttctaatttcattcttttacatgtagctgtccagttttcccagcaccacttatt
Line 3: gaagaggctgtcttttctccactgtatattcttgcctcctttgtcaaagataaggtgaccatatctgcgtgggtttatctctgggctttctatcctgttc

Somewhere in the middle of the file a newline character was missed for a 2nd fasta entry.

Line 783808: attgacacgtggcactgaacccagagtggcaagtcttccccgtttcccagagaacccacaattccccgtcctatgtgaaatcccccaagttttaaatacc
Line 783810: GACATTATAGATACATTTGATAATTAAAAGGAATAGTACGTATTCCAGCTAGGAGGAGGAGCCCTCCTTTTCGACTGGTTTTAGTCGATTAAGAAGGTTG
Line 783811: TGGGGTTTTGTATGTATGTTAAGATGATACCAGTTTTTGTCTTCATCACGGCTCTGAGCTGTTCAGATAGCTTATTCATCTAAGGTGAG>id=GCF_009
Line 783812: 873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0acaaggagtagcccccactagccacaactagaggaagtccacatgcagcaat
Line 783813: gaagacacaacgcagccaaaaataaataatgaataaataaataagttaattaattaattaattaaaaaaataagagtagagtggaaattcaggaagttga

I certainly hope this is not a fatal flaw requiring a total restart but i suspect it is. Wondering about the cause here. Any ideas?

macmanes commented 2 months ago

@glennhickey any chance it's the pipes in the fasta headers that are causing this issue?

glennhickey commented 2 months ago

Could be. When I try to change some names in the test data to look like yours, I get an error right away

RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'id=simMouse_chr6|873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0' in 'simMouse_chr6'
macmanes commented 2 months ago

it's funny that I don't get that error until much later - does "sanitize_fasta" deal with |'s? They are sadly common in NCBI downloaded genomes.

Anyway I went for the clean/full restart to see if the error is reproducible or if it could have been related to some ?transient read/write issue.

glennhickey commented 2 months ago

That check is on by default because the ucsc genome browser doesn't (or didn't) support these characters in assembly hubs. If I disable the check by setting checkAssemblyHub="0" in the config, then my test runs through fine.

halStats em.hal --sequenceStats simMouse_chr6
SequenceName, Length, NumTopSegments, NumBottomSegments
873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0, 636262, 46692, 0
873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0, 850, 104, 0
873245.2_mBalMus1.pri.v4|NC_045788.1|144968589|0, 1250, 129, 0

but since the check is on by default, I don't know why it didn't complain for you. It's not in cactus_santiize_fasta_headers but slightly upstream when running cactus.

In any case, I do not know what caused your error, and suggest double checking your input file. But if you are sure it's cactus causing the problem, please send me the input so I can try to reproduce.