ParkinsonLab / MetaPro

GNU General Public License v3.0
17 stars 3 forks source link

no taxid_trees file #20

Open hughit32 opened 8 months ago

hughit32 commented 8 months ago

Metapro is missing the taxid_trees file, or whatever file is supposed to be found in that folder. The lib_downloader.py script seems to have completed successfully, but no file or folder with that name was created. Can you tell me where this file can be found? Thank you!

billytaj commented 8 months ago

i'll look into it. meanwhile: https://compsysbio.org/metapro_libs/taxid_trees/

hughit32 commented 8 months ago

Thanks for sending the link to the taxid_trees files. It's still not clear which of these should be used though. The value in place in the supplied Config.ini file has class_tree.tsv. When the taxid_tree line is removed from Config.ini, the default is family_tree.tsv. When I used family_tree.tsv, the error about the missing file does not occur, but another error occurs shortly after which kills the whole pipeline: 2024-01-27 02:38:40.774855 Running GA lib check 2024-01-27 02:38:40.774946 BWA DB check: /temp/smallMetaproOutput/GA_pre_scan/final_results 2024-01-27 02:38:40.775267 Error: no fasta files found. BWA only accepts .fasta extensions empty BWA database This was happening before I supplied metapro with the taxid_tree file, and I assumed that it was because of the missing taxid_tree file error. But with that error resolved, I'm still getting this missing fasta file error. It appears to be from a failed previous step, since the expected file is supposed to be in the output folder, not a designated database file. But, there are no errors displayed from previous steps now. Am I using the wrong taxid_tree file, or must all 5 of them be specified somehow? Thanks!

billytaj commented 8 months ago

gotta redo the GA_pre_scan step.

That phase is used to pre-fetch Taxa to gather a curated database.
To redo the ga_pre_scan step: 1) edit the bypass_log.txt and remove the entry that says "ga_pre_scan" 2) delete the folder named: GA_pre_scan from your directory.

Let me know if this works.
<PS: i'm keeping this issue open for a reminder, as there will be a permanent fix for it in V4, coming soon.>

hughit32 commented 8 months ago

OK, thanks. I tried this again and unfortunately got the same result. Here's the console output, minus the initialization checks:

MetaPro operating in auto-mode Reads: /temp/small.fastq Output filepath: /temp/smallMetaproOutput job path: /temp/smallMetaproOutput/quality_filter 2024-01-29 20:51:28.842356 bypassing: quality_filter 2024-01-29 20:51:28.842378 skipping job: quality_filter quality filter: 0.0 s quality filter cleanup: 0.0 s 2024-01-29 20:51:28.842413 continuing from: quality_filter job path: /temp/smallMetaproOutput/vector_filter 2024-01-29 20:51:28.842696 bypassing: vector_filter 2024-01-29 20:51:28.842716 skipping job: vector_filter vector filter: 0.0 s vector filter cleanup: 0.0 s 2024-01-29 20:51:28.842748 continuing from: vector_filter 2024-01-29 20:51:28.842857 bypassing: rRNA_filter rRNA filter: 0.0 s rRNA filter cleanup: 0.0 s 2024-01-29 20:51:28.842897 continuing from: rRNA_filter 2024-01-29 20:51:28.843001 bypassing: duplicate_repopulation repop: 0.0 s repop cleanup: 0.0 s 2024-01-29 20:51:28.843040 continuing from: duplicate_repopulation 2024-01-29 20:51:28.843155 bypassing: assemble_contigs 2024-01-29 20:51:28.843204 MGM OK. contigs present assemble contigs: 0.0 s assemble contigs cleanup: 0.0 s 2024-01-29 20:51:28.843245 continuing from: assemble_contigs 2024-01-29 20:51:28.843347 running: GA_pre_scan GA_pre_scan/data/jobs/mp_ta_centrifuge_readstigs job submitted. mem: 41.43948046875 GB GB Kraken2 on singletons GA_pre_scan/data/jobs/mp_ta_centrifuge_contigsads job submitted. mem: 41.4370546875 GB 2024-01-29 20:51:28.857016 closing down processes: 4ob submitted. mem: 41.435875 GB Kraken2 on contigs8.857069 closed down: 0/4 centrifuge on reads Loading database information...Loading database information...centrifuge on contigs report file /temp/smallMetaproOutput/GA_pre_scan/data/2_centrifuge/raw_contigs.txt Number of iterations in EM algorithm: 0 Probability diff. (P - P_prev) in the last iteration: 0 Calculating abundance: 00:00:00 done. done. 8611 sequences (1.68 Mbp) processed in 0.119s (4339.4 Kseq/m, 845.14 Mbp/m). 574 sequences classified (6.67%) 8037 sequences unclassified (93.33%) 2764739 sequences (396.53 Mbp) processed in 2.031s (81667.0 Kseq/m, 11713.15 Mbp/m). 85678 sequences classified (3.10%) 2679061 sequences unclassified (96.90%) report file /temp/smallMetaproOutput/GA_pre_scan/data/2_centrifuge/reads.txt Number of iterations in EM algorithm: 5 Probability diff. (P - P_prev) in the last iteration: 6.61514e-11 Calculating abundance: 00:00:00 2024-01-29 20:52:03.990365 closing down processes: 1d. mem: 41.262828125 GB merging kraken2 reports406 closed down: 0/1 2024-01-29 20:52:04.096440 closing down processes: 1tted. mem: 41.263078125 GB combining all centrifuge resultsd down: 0/1 2024-01-29 20:52:04.189804 running: TA_wevote_combineitted. mem: 41.24258984375 GB 2024-01-29 20:52:04.189855 closing down processes: 1 combining classification outputs for wevote Running WEVOTE gathering WEVOTE results 2024-01-29 20:53:01.325260 running: ga_collect_dbitted. mem: 41.0637734375 GB 2024-01-29 20:53:01.325291 closing down processes: 1 GA pre-scan get libs325325 closed down: 0/1 2024-01-29 20:53:08.987157 running: ga_assemble_dbitted. mem: 41.08355859375 GB 2024-01-29 20:53:08.987203 closing down processes: 1 GA assemble libs:08.987232 closed down: 0/1 2024-01-29 20:53:09.041072 continuing from: GA_pre_scan 2024-01-29 20:53:09.041410 running: GA_split 2024-01-29 20:53:09.041446 splitting contigs 2024-01-29 20:53:09.048105 closing down processes: 2 splitting fasta for contigsclosed down: 0/2 splitting fastq for singletons GA 2024-01-29 20:53:13.661690 continuing from: GA_split 2024-01-29 20:53:13.661788 Running GA lib check 2024-01-29 20:53:13.661870 BWA DB check: /temp/smallMetaproOutput/GA_pre_scan/final_results 2024-01-29 20:53:13.662173 Error: no fasta files found. BWA only accepts .fasta extensions empty BWA database

Thanks.

billytaj commented 8 months ago

your issue is something else: There's a special version of the chocophlan database metapro draws from in this version. http://compsysbio.org/metapro_libs/

in that URL, there's choco_h3_family/genus/order

Those are the same chocophlan DB, just that they've been clustered by taxa level.
so, this GA-pre-scan step will draw on that cluster to assemble your GA database.

There's bypasses. You could put a gene database inside /temp/smallMetaproOutput/GA_pre_scan/final_results, bwa-index it, and move on. or you could make sure the config points to one of the clusters.

hughit32 commented 8 months ago

Thanks for your message, sorry for the delay in circling back to this. I made sure the config file points to path/to/outputFile/choco_h3_family for DNA_DB and that it points to path/to/outputFile/taxid_trees/family_tree.tsv for taxid_tree. I still get the same error as above. The problem is that the pipeline is finding nothing in path/to/outputFile/GA_pre_scan/final_results. When I looked at the path/to/outputFile/GA_pre_scan/ga_collect_db.sh script built by the pipeline, it has /project/j/jparkin/Lab_Databases/family_llbs in the sixth argument position. When I looked at the code for create_GA_pre_scan_command() in MetaPro_commands.py, it's getting that from self.tool_path_obj.source_taxa_DB. There is no entry for source_taxa_DB in the Config.ini file, so I believe it is defaulting to a folder on your personal system. Since it doesn't find the folder on my system, it is silently skipping the step and then throwing an error downstream when it finds the 'final_results' folder empty. If all this is correct, it seems like I need a file to point source_taxa_DB to. Apologies if I have this wrong or I'm overlooking something. Thanks!

tkcaccia commented 6 months ago

I met the same issue. taxid_tree files is missing. I cannot reach the website https://compsysbio.org/metapro_libs on 14 April.

An additional note. The pipeline to download the databases does not download the database kaiju.

Thank you in advance

billytaj commented 6 months ago

I met the same issue. taxid_tree files is missing. I cannot reach the website https://compsysbio.org/metapro_libs on 14 April.

An additional note. The pipeline to download the databases does not download the database kaiju.

Thank you in advance

Server was down due to a power outage on April 14th. It's back up running now.

tmattes8 commented 6 months ago

I am writing because the pipeline is stuck in the same place as hughit32 commented on Feb 19. I can confirm that the ga_collect_db.sh script built by the pipeline has /project/j/jparkin/Lab_Databases/family_llbs in the sixth argument position.

Any suggestions on how to alter the Config.ini file to help us get past this point in the pipeline?

Edit: upon running again, I see in my output:

source_taxa_db no inner section found. using default /project/j/jparkin/Lab_Databases/family_llbs

How/where do I point to the source_taxa_db inner section?

Edit2: Once again, as soon as I give up and post on github, I figure something out. For those interested, I edited Config.ini to give source_taxa_db a path to the Choco family group folder in my librairies. It got me past the roadblock, but not sure if all I get is family classifications from here on out? The Chocophlan database is broken into three folders. Do we need to run separately to get class and genus?

Thanks,

billytaj commented 6 months ago

I am writing because the pipeline is stuck in the same place as hughit32 commented on Feb 19. I can confirm that the ga_collect_db.sh script built by the pipeline has /project/j/jparkin/Lab_Databases/family_llbs in the sixth argument position.

Any suggestions on how to alter the Config.ini file to help us get past this point in the pipeline?

Edit: upon running again, I see in my output:

source_taxa_db no inner section found. using default /project/j/jparkin/Lab_Databases/family_llbs

How/where do I point to the source_taxa_db inner section?

Edit2: Once again, as soon as I give up and post on github, I figure something out. For those interested, I edited Config.ini to give source_taxa_db a path to the Choco family group folder in my librairies. It got me past the roadblock, but not sure if all I get is family classifications from here on out? The Chocophlan database is broken into three folders. Do we need to run separately to get class and genus?

Thanks,

Can you please paste your config?

tmattes8 commented 6 months ago

I am writing because the pipeline is stuck in the same place as hughit32 commented on Feb 19. I can confirm that the ga_collect_db.sh script built by the pipeline has /project/j/jparkin/Lab_Databases/family_llbs in the sixth argument position. Any suggestions on how to alter the Config.ini file to help us get past this point in the pipeline? Edit: upon running again, I see in my output: source_taxa_db no inner section found. using default /project/j/jparkin/Lab_Databases/family_llbs How/where do I point to the source_taxa_db inner section? Edit2: Once again, as soon as I give up and post on github, I figure something out. For those interested, I edited Config.ini to give source_taxa_db a path to the Choco family group folder in my librairies. It got me past the roadblock, but not sure if all I get is family classifications from here on out? The Chocophlan database is broken into three folders. Do we need to run separately to get class and genus? Thanks,

Can you please paste your config?

Here is the relevant portion of the config.ini

[Databases] database_path = /Shared/lss_tmattes/Metapro UniVec_Core = %(database_path)s/univec_core/UniVec_Core.fasta Adapter = %(database_path)s/Trimmomatic_adapters/TruSeq3-PE-2.fa

Host = %(database_path)s/human_genome/human_genome.fasta

Rfam = %(database_path)s/Rfam/Rfam.cm DNA_DB = %(database_path)s/family_group source_taxa_db = %(database_path)s/family_group Prot_DB = %(database_path)s/nr/nr Prot_DB_reads = %(database_path)s/nr/nr accession2taxid = %(database_path)s/accession2taxid/accession2taxid nodes = %(database_path)s/WEVOTE_db/nodes_wevote.dmp names = %(database_path)s/WEVOTE_db/names_wevote.dmp Kaiju_db = %(database_path)s/kaiju_db/kaiju_db_nr.fmi Centrifuge_db = %(database_path)s/centrifuge_db/nt SWISS_PROT = %(database_path)s/swiss_prot_db/swiss_prot_db SWISS_PROT_map = %(database_path)s/swiss_prot_db/SwissProt_EC_Mapping.tsv PriamDB = %(database_path)s/PRIAM_db/ DetectDB = %(database_path)s/DETECTv2 WEVOTEDB = %(database_path)s/WEVOTE_db/ EC_pathway = %(database_path)s/EC_pathway/EC_pathway.txt path_to_superpath = %(database_path)s/path_to_superpath/pathway_to_superpathway.csv MetaGeneMark_model = /pipeline_tools/mgm/MetaGeneMark_v1.mod taxid_tree = %(database_path)s/taxid_trees/family_tree.tsv kraken2_db = %(database_path)s/kraken2

tmattes8 commented 6 months ago

The pipeline is now currently stuck at the Diamond step. Diamond jobs get submitted but the pipeline keeps getting killed for some reason before they finish. I feel like it is perhaps a memory issue or an issue with my hpc job scheduler getting overwhelmed with 40 diamond jobs, but I really don't know yet. Thought I would throw it out there if you had any ideas.

billytaj commented 6 months ago

Diamond is notoriously slow. MetaPro does its best to push through as much as a cluster node will allow it, but it's all at the mercy of the specs of your compute environment, and your data.

tmattes8 commented 5 months ago

Diamond is notoriously slow. MetaPro does its best to push through as much as a cluster node will allow it, but it's all at the mercy of the specs of your compute environment, and your data.

Yes, I throttled the number of diamond jobs submitted back to 5 at a time and the pipeline continued well without issues. Probably room to optimize it a little higher. Thanks,

billytaj commented 5 months ago

Diamond is notoriously slow. MetaPro does its best to push through as much as a cluster node will allow it, but it's all at the mercy of the specs of your compute environment, and your data.

Yes, I throttled the number of diamond jobs submitted back to 5 at a time and the pipeline continued well without issues. Probably room to optimize it a little higher. Thanks,

There's supposed to be a memory analyzer, but it's not perfect <measures mem usage in discrete timeslices, to make sure your system doesn't OOM>. will revisit when there's time.