cpockrandt / PhyloCSFpp

PhyloCSF++ computes PhyloCSF tracks for whole-genome multiple sequence alignments, scores single MSA, annotates CDS features in GFF/GTF files with PhyloCSF and confidence scores.
Other
30 stars 4 forks source link

Error when running mmseqs createsubdb: sh: 1: Syntax error: ")" unexpected #10

Open marcasriv opened 3 years ago

marcasriv commented 3 years ago

Hi,

I'm interested in running PhyloCSF++ with annotate-with-mmseqs on Chinese hamster, but I am getting an error when it reaches the mmseqs createsubdb step:

./phylocsf++ annotate-with-mmseqs --threads 35 --output conservation species.txt 58mammals criGri1.refGene.gtf

Checking whether MMseqs2 is installed ... Processing GFF /mnt/HDD2/conservation/criGri1.refGene.gtf Created the genomesDB directory. Created the cds directory. Reading reference genome of GFF file /mnt/HDD2/conservation/fastas/criGri1.fa ... Reading GFF file and extracting CDS coordinates ... MMseqs2: Indexing genomes ... MMseqs Version: 42bf6438fec1e1b987f46d8f6d4b09926ecfc019 Database type 0 Shuffle input database true Createdb mode 0 Write lookup file 1 Offset of numeric ids 0 Compressed 0 Verbosity 3

Converting sequences [410465] 1m 2s 307ms Time for merging to genbankseqs_h: 0h 0m 0s 74ms Time for merging to genbankseqs: 0h 0m 43s 532ms Database type: Nucleotide Time for processing: 0h 1m 46s 799ms bash -c $'mmseqs createsubdb <(awk \'$3 == 0\' /mnt/HDD2/conservation//genomesDB/genbankseqs.lookup) conservation//genomesDB/genbankseqs /mnt/HDD2/conservation//genomesDB/genbankseqs_0' sh: 1: Syntax error: ")" unexpected

This is how the input species.txt file looks like:

chinese_hamster conservation/fastas/criGri1.fa mouse conservation/fastas/Mus_musculus.GRCm39.dna.primary_assembly.fa rat conservation/fastas/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa human conservation/fastas/Homo_sapiens.GRCh38.dna.primary_assembly.fa naked_mole_rat conservation/fastas/Heterocephalus_glaber_female.HetGla_female_1.0.dna.toplevel.fa guinea_pig conservation/fastas/Cavia_porcellus.Cavpor3.0.dna.toplevel.fa squirrel conservation/fastas/Ictidomys_tridecemlineatus.SpeTri2.0.dna.toplevel.fa rabbit conservation/fastas/Oryctolagus_cuniculus.OryCun2.0.dna.toplevel.fa pika conservation/fastas/Ochotona_princeps.OchPri2.0-Ens.dna.toplevel.fa

And I have downloaded the reference GTF file and fasta files from https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/genes/criGri1.refGene.gtf.gz and https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/criGri1.fa.gz

Thanks so much,

Marina

cpockrandt commented 2 years ago

Hi Marina,

thank you for trying out PhyloCSF++ and opening an issue! I made a fix and pushed it to the master branch. Can you try running it again with the latest commit? Let me know if you need help building PhyloCSF++ from source, I can also upload a statically linked binary here.

If the fix works for you, we will make a new release, update it on bioconda and distribute new binaries.

Christopher

marcasriv commented 2 years ago

Hi Christopher,

Thanks so much for your help and fix! I re-built PhyloCSF++ with the latest commit and it is now running smoothly pass the error. Unfortunately I've bumped into a new problem. The program it's crashing now at (I believe) line 422 in script _phylocsf++annotate_withmmseqs.hpp (same parameters/files as in previous post):

mmseqs result2dnamsa conservation//cds/cds.index conservation//genomesDB/genbankseqs /conservation//aln/aln_all_tophit conservation//aln/msa --threads _40

MMseqs Version: 42bf6438fec1e1b987f46d8f6d4b09926ecfc019 Skip query false Threads 40 Compressed 0 Verbosity 3 Query database size: 99405 type: Nucleotide Target database size: 410501 type: Nucleotide [=================================================================] 100.00% 99.40K 7m 13s 889ms Time for merging to msa: 0h 0m 0s 216ms Time for processing: 0h 7m 15s 116ms MMseqs2: Score aligned CDS ...

terminate called after throwing an instance of 'std::length_error' terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively terminate called recursively what(): terminate called recursively terminate called recursively terminate called recursively Aborted (core dumped)

Thanks again,

Marina

cpockrandt commented 2 years ago

Can you give me the list of assemblies you used, so that we can try to reproduce this error?

marcasriv commented 2 years ago

Hi Christopher,

Sorry for the late reply. This is the list of fasta files I use:

https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/criGri1.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/mm39/bigZips/mm39.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/rn6/bigZips/rn6.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/hetGla2/bigZips/hetGla2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/cavPor3/bigZips/cavPor3.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/speTri2/bigZips/speTri2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/oryCun2/bigZips/oryCun2.fa.gz https://hgdownload.soe.ucsc.edu/goldenPath/ochPri3/bigZips/ochPri3.fa.gz

and reference GTF:

https://hgdownload.soe.ucsc.edu/goldenPath/criGri1/bigZips/genes/criGri1.refGene.gtf.gz

Thanks,

Marina

cpockrandt commented 2 years ago

Hi Marina,

thank you, we were able to reproduce the error and added a fix to the master branch. Before you run it again, please make sure to delete any temporary files in the output directory from the previous runs.

Christopher

marcasriv commented 2 years ago

Hi Christopher,

Thanks so much for your reply. I've removed the previous installation of PhyloCSF++ , cloned the latest PhyloCSF++ version and re-installed, and removed any previous files but I'm still getting the same error in the same line of code. I've also tried to change the location of the output directory , but unfortunately no luck so far. Could there be anything in my system overriding the new install?

Marina

cpockrandt commented 2 years ago

Hi Marina,

I tried it on another system and it works for me with the latest commit and data set that you listed above. You don't have to "install" PhyloCSF++ on your system, after make you can just call the binary directly in the build directory with ./phylocsf++ to make sure that you really use the latest build and not an outdated binary that might still be in the PATH.