Open macmanes opened 3 months ago
Note searching the original file for illicit characters does not return anything..
grep -v '^>' GCF_011100685.1_UU_Cfam_GSD_1.0_genomic.fna.masked | grep -o '[^ATCGactgNn]'
That's a new one. Given the error says (greater than sign ">")
perhaps it's an empty sequence names that's triggering it? Cactus does its own check for non-agctn characters well upstream of this, so it's likely something less simple.
You should be able to confirm by creating a work directory (I'm using ./work
below) and rerunning the failing command with these flags added:
--restart --caching false --cleanWorkDir never --workDir ./work
then pull the relevant fasta files being input into lastz
out of find ./work
and inspect them yourself.
yup so there are some issues with some of the fasta files - but not empty headers.
1st lines of one of the offending fasta files (which I wrapped to make parsing easier) - looks good here.
Line 1: >id=GCF_009873245.2_mBalMus1.pri.v3|NC_045787.1|171266408|90000000
Line 2: ctttaatccattttgagtttatttttgtgtgtggtgttaggaagtgttctaatttcattcttttacatgtagctgtccagttttcccagcaccacttatt
Line 3: gaagaggctgtcttttctccactgtatattcttgcctcctttgtcaaagataaggtgaccatatctgcgtgggtttatctctgggctttctatcctgttc
Somewhere in the middle of the file a newline character was missed for a 2nd fasta entry.
Line 783808: attgacacgtggcactgaacccagagtggcaagtcttccccgtttcccagagaacccacaattccccgtcctatgtgaaatcccccaagttttaaatacc
Line 783810: GACATTATAGATACATTTGATAATTAAAAGGAATAGTACGTATTCCAGCTAGGAGGAGGAGCCCTCCTTTTCGACTGGTTTTAGTCGATTAAGAAGGTTG
Line 783811: TGGGGTTTTGTATGTATGTTAAGATGATACCAGTTTTTGTCTTCATCACGGCTCTGAGCTGTTCAGATAGCTTATTCATCTAAGGTGAG>id=GCF_009
Line 783812: 873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0acaaggagtagcccccactagccacaactagaggaagtccacatgcagcaat
Line 783813: gaagacacaacgcagccaaaaataaataatgaataaataaataagttaattaattaattaattaaaaaaataagagtagagtggaaattcaggaagttga
I certainly hope this is not a fatal flaw requiring a total restart but i suspect it is. Wondering about the cause here. Any ideas?
@glennhickey any chance it's the pipes in the fasta headers that are causing this issue?
Could be. When I try to change some names in the test data to look like yours, I get an error right away
RuntimeError: An invalid character was found in the first word of a fasta header. Acceptable characters for headers in an assembly hub include alphanumeric characters plus '_', '-', ':', and '.'. Please modify your headers to eliminate other characters. The offending header: 'id=simMouse_chr6|873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0' in 'simMouse_chr6'
it's funny that I don't get that error until much later - does "sanitize_fasta" deal with |
's? They are sadly common in NCBI downloaded genomes.
Anyway I went for the clean/full restart to see if the error is reproducible or if it could have been related to some ?transient read/write issue.
That check is on by default because the ucsc genome browser doesn't (or didn't) support these characters in assembly hubs. If I disable the check by setting checkAssemblyHub="0"
in the config, then my test runs through fine.
halStats em.hal --sequenceStats simMouse_chr6
SequenceName, Length, NumTopSegments, NumBottomSegments
873245.2_mBalMus1.pri.v3|NC_045788.1|144968589|0, 636262, 46692, 0
873245.2_mBalMus1.pri.v1|NC_045788.1|144968589|0, 850, 104, 0
873245.2_mBalMus1.pri.v4|NC_045788.1|144968589|0, 1250, 129, 0
but since the check is on by default, I don't know why it didn't complain for you. It's not in cactus_santiize_fasta_headers
but slightly upstream when running cactus
.
In any case, I do not know what caused your error, and suggest double checking your input file. But if you are sure it's cactus causing the problem, please send me the input so I can try to reproduce.
@macmanes did you figure this out? I'm also running into this issue aligning some ncbi genomes...
Hi All - having many jobs fail with this type of error, which seems to indicated perhaps something about a poorly formatted fasta file? One issue is that I can't seem to find the offending file
GCF_011100685.1_UU_Cfam_GSD_1.0_5.fa
. Not sure if this is related to some upstream error that was not handled properly.Any help here greatly appreciated!