ParkinsonLab / MetaPro

GNU General Public License v3.0
18 stars 3 forks source link

GA_pre_scan results folder empty #24

Closed tkcaccia closed 3 months ago

tkcaccia commented 6 months ago

I have some issues with completing the pipeline. The step GA_pre_scan does not produce any files in the final_resultss folder and then the pipeline stops in the GA_split step. can you please help me to identify the error?

Here below the output.

2024-04-18 09:01:21.514892 continuing from: assemble_contigs 2024-04-18 09:01:21.518869 running: GA_pre_scan 2024-04-18 09:01:21.548006 mp_ta_kraken2_singletons job submitted. mem: 375.48778515625 GB^M2024-04-18 09:01:21.560298 mp_ta_kraken2_paired job submitted. mem: 375.4842890625 GB^MKraken2 on singletons Kraken2 on paired 2024-04-18 09:01:21.573631 mp_ta_kraken2_contigs job submitted. mem: 375.4856484375 GB^MGA_pre_scan/data/jobs/mp_ta_centrifuge_reads Kraken2 on contigs 2024-04-18 09:01:21.600288 mp_ta_centrifuge_reads job submitted. mem: 375.482765625 GB^MGA_pre_scan/data/jobs/mp_ta_centrifuge_contigs centrifuge on reads Loading database information...Loading database information...Loading database information...centrifuge on contigs done. done. done. 15475 sequences (12.36 Mbp) processed in 0.600s (1547.2 Kseq/m, 1235.55 Mbp/m). 15377 sequences classified (99.37%) 98 sequences unclassified (0.63%) 41252 sequences (12.02 Mbp) processed in 0.774s (3195.9 Kseq/m, 931.33 Mbp/m). 40639 sequences classified (98.51%) 613 sequences unclassified (1.49%) 677476 sequences (78.17 Mbp) processed in 0.870s (46697.5 Kseq/m, 5388.09 Mbp/m). 628361 sequences classified (92.75%) 49115 sequences unclassified (7.25%) report file /scratch/t0065634/Microbiome/output_batch2/LPC0010_S8/GA_pre_scan/data/2_centrifuge/raw_contigs.txt Number of iterations in EM algorithm: 4 Probability diff. (P - P_prev) in the last iteration: 3.70532e-11 Calculating abundance: 00:00:00 report file /scratch/t0065634/Microbiome/output_batch2/LPC0010_S8/GA_pre_scan/data/2_centrifuge/reads.txt Number of iterations in EM algorithm: 13 Probability diff. (P - P_prev) in the last iteration: 8.45475e-11 Calculating abundance: 00:00:00 2024-04-18 09:01:21.618062 mp_ta_centrifuge_contigs job submitted. mem: 375.47983984375 GB^M2024-04-18 09:01:21.619364 closing down processes: 5 2024-04-18 09:01:21.619401 closed down: 0/5 ^M2024-04-18 09:03:09.809845 closed down: 1/5 ^M2024-04-18 09:03:09.809963 closed down: 2/5 ^M2024-04-18 09:03:09.810030 closed down: 3/5 ^M2024-04-18 09:13:37.616210 closed down: 4/5 ^Mmerging kraken2 reports 2024-04-18 09:13:37.622425 TA_kraken2_pp job submitted. mem: 375.4827734375 GB^M2024-04-18 09:13:37.623675 closing down processes: 1 2024-04-18 09:13:37.623712 closed down: 0/1 ^Mcombining all centrifuge results 2024-04-18 09:13:37.938608 TA_centrifuge_pp job submitted. mem: 375.48255078125 GB^M2024-04-18 09:13:37.940008 closing down processes: 1 2024-04-18 09:13:37.940046 closed down: 0/1 ^Mcombining classification outputs for wevote Running WEVOTE gathering WEVOTE results 2024-04-18 09:13:38.094341 TA_wevote_combine job submitted. mem: 375.48346484375 GB^M2024-04-18 09:13:38.095641 running: TA_wevote_combine 2024-04-18 09:13:38.095690 closing down processes: 1 2024-04-18 09:13:38.095718 closed down: 0/1 ^MGA pre-scan get libs 2024-04-18 09:15:58.784956 ga_collect_db job submitted. mem: 375.4834921875 GB^M2024-04-18 09:15:58.786435 running: ga_collect_db 2024-04-18 09:15:58.786477 closing down processes: 1 2024-04-18 09:15:58.786506 closed down: 0/1 ^MGA assemble libs 2024-04-18 09:16:08.826043 ga_assemble_db job submitted. mem: 375.48344140625 GB^M2024-04-18 09:16:08.827014 running: ga_assemble_db 2024-04-18 09:16:08.827046 closing down processes: 1 2024-04-18 09:16:08.827063 closed down: 0/1 ^M2024-04-18 09:16:08.934087 continuing from: GA_pre_scan 2024-04-18 09:16:08.938664 running: GA_split 2024-04-18 09:16:08.938700 splitting contigs splitting fasta for contigs splitting fastq for singletons GA splitting fastq for pair_1 GA splitting fastq for pair_2 GA 2024-04-18 09:16:09.008651 closing down processes: 4 2024-04-18 09:16:09.008748 closed down: 0/4 ^M2024-04-18 09:16:11.656524 closed down: 1/4 ^M2024-04-18 09:16:11.656631 closed down: 2/4 ^M2024-04-18 09:16:11.656673 closed down: 3/4 ^M2024-04-18 09:16:13.681369 continuing from: GA_split 2024-04-18 09:16:13.681450 Running GA lib check 2024-04-18 09:16:13.681531 BWA DB check: /scratch/t0065634/Microbiome/output_batch2//LPC0010_S8/GA_pre_scan/final_results 2024-04-18 09:16:13.686604 Error: no fasta files found. BWA only accepts .fasta extensions empty BWA database

billytaj commented 6 months ago

what does your config look like? are all databases downloaded?

tkcaccia commented 6 months ago

I realized the script lib_downloader.py did not download all libraries. So I downloaded again the missing one. The output showed that all libraries were found: UniVec_Core found! using: /scratch/alphafold/MetaPro/univec_core/UniVec_Core.fasta Adapter found! using: /scratch/alphafold/MetaPro/trimmomatic_adapters/TruSeq3-PE-2.fa Host found! using: /scratch/alphafold/MetaPro/human_genome/human_genome.fasta Rfam found! using: /scratch/alphafold/MetaPro/Rfam/Rfam.cm DNA_DB found! using: /scratch/alphafold/MetaPro/family_group source_taxa_db no inner section found. using default /project/j/jparkin/Lab_Databases/family_llbs Prot_DB found! using: /scratch/alphafold/MetaPro/nr/nr Prot_DB_reads found! using: /scratch/alphafold/MetaPro/nr/nr accession2taxid found! using: /scratch/alphafold/MetaPro/accession2taxid/accession2taxid nodes found! using: /scratch/alphafold/MetaPro/WEVOTE_db/nodes_wevote.dmp names found! using: /scratch/alphafold/MetaPro/WEVOTE_db/names_wevote.dmp Kaiju_db found! using: /scratch/alphafold/MetaPro/kaiju_db/kaiju_db_nr.fmi Centrifuge_db found! using: /scratch/alphafold/MetaPro/centrifuge_db/nt SWISS_PROT found! using: /scratch/alphafold/MetaPro/swiss_prot_db/swiss_prot_db SWISS_PROT_map found! using: /scratch/alphafold/MetaPro/swiss_prot_db/SwissProt_EC_Mapping.tsv PriamDB found! using: /scratch/alphafold/MetaPro/PRIAM_db/ DetectDB found! using: /scratch/alphafold/MetaPro/DETECTv2 WEVOTEDB found! using: /scratch/alphafold/MetaPro/WEVOTE_db/ EC_pathway found! using: /scratch/alphafold/MetaPro/EC_pathway/EC_pathway.txt path_to_superpath found! using: /scratch/alphafold/MetaPro/path_to_superpath/pathway_to_superpathway.csv MetaGeneMark_model found! using: /pipeline_tools/mgm/MetaGeneMark_v1.mod enzyme_db no inner section found. using default /pipeline/custom_databases/FREQ_EC_pairs_3_mai_2020.txt taxid_tree found! using: /scratch/alphafold/MetaPro/taxid_trees/class_tree.tsv kraken2_db found! using: /scratch/alphafold/MetaPro/kraken2_db

The pipeline stopped at GA_split but I noted the results folder was empty in GA_pre_scan, so I manually removed these folders and remove GA_split and GA_pre_scan from bypass_log.txt

How can I identify where is the problem?

billytaj commented 6 months ago

if you need to dive into the code, all steps create a shellscript for their specific section. you could run the shellscript for that step manually to see where the system is stalling.

tkcaccia commented 6 months ago

The script does not stall. No FASTA files are produced in GA_pre_scan

billytaj commented 6 months ago

so, the config says it can't find your source taxa db. GA_pre_scan relies on these taxid trees we made: https://compsysbio.org/metapro_libs/taxid_trees/ These trees link every taxa found in chocophlan to their higher-order rollups.

Your run is missing these tables.

Gabe-BioUSD commented 4 months ago

Hi billytaj, I am having the same issue. First, I was having only the class_tsv, but from your reply to the above I get the other tax tree files. However, the pipeline still ended with the error ~/Outs/GA_pre_scan/final_results 2024-06-18 04:50:47.953054 Error: no fasta files found. BWA only accepts .fasta extensions empty BWA database. tkcaccia, did you resolve the problem? Thanks

billytaj commented 4 months ago

this error is a warning that the pre-scan didn't function properly.
it's supposed to taxa-scan your cleaned reads and populate a customized subset of the chocophlan database. There's ways to bypass it if you want.

Gabe-BioUSD commented 4 months ago

Could you point to how we can bypassed that's Thank

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Billy Taj @.> Sent: Wednesday, June 26, 2024 11:31:59 AM To: ParkinsonLab/MetaPro @.> Cc: Agany, Diing @.>; Comment @.> Subject: Re: [ParkinsonLab/MetaPro] GA_pre_scan results folder empty (Issue #24)

You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

this error is a warning that the pre-scan didn't function properly. it's supposed to taxa-scan your cleaned reads and populate a customized subset of the chocophlan database. There's ways to bypass it if you want.

— Reply to this email directly, view it on GitHubhttps://github.com/ParkinsonLab/MetaPro/issues/24#issuecomment-2192143934, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATSTELNB6W7HOMFRGNUQ2Q3ZJLUH7AVCNFSM6AAAAABGNCYCTKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJSGE2DGOJTGQ. You are receiving this because you commented.Message ID: @.***>

billytaj commented 3 months ago

in your config, under the Databases heading, Add in DNA_DB_override = True