biologger / speciesprimer

The SpeciesPrimer pipeline is intended to help researchers finding specific primer pairs for the detection and quantification of bacterial species in complex ecosystems.
GNU General Public License v3.0
39 stars 19 forks source link

Single results file not created #24

Open JensPee opened 1 year ago

JensPee commented 1 year ago

Hi,

When I run or rerun speciesprimer (on a docker container with 15.51 gb RAM allocated to it) a single results file is not successfully created. I can not determine from the logs what the problem is. Any help would be appreciated. ( I noticed that the BLAST DB is 250 gb not 60 gb and I don't know why. Is this maybe part of the problem?) Settings are as follows: {'blastseqs': 500, 'skip_tree': False, 'minsize': 75, 'path': '/primerdesign', 'mfethreshold': 90, 'nolist': False, 'ignore_qc': False, 'maxsize': 150, 'probe': False, 'offline': False, 'nontargetlist': [...], 'assemblylevel': ['complete'], 'skip_download': False, 'target': 'Azotobacter_chroococcum', 'intermediate': False, 'qc_gene': ['rRNA'], 'exception': [], 'mpprimer': -3.5, 'blastdbv5': False, 'customdb': None, 'mfold': -3.0}

The following problem shows up in the logs: Run: run_blast - Start BLAST 27 Jun 2023 05:18:42: Run blastn -task blastn-short -num_threads 4 -query primer.part-0 -evalue 500 -out primer_0_results.xml -outfmt 5 -db nt 27 Jun 2023 14:41:00: Run blastn -task blastn-short -num_threads 4 -query primer.part-1 -evalue 500 -out primer_1_results.xml -outfmt 5 -db nt 27 Jun 2023 23:47:50: Run blastn -task blastn-short -num_threads 4 -query primer.part-2 -evalue 500 -out primer_2_results.xml -outfmt 5 -db nt 28 Jun 2023 09:20:30: Run blastn -task blastn-short -num_threads 4 -query primer.part-3 -evalue 500 -out primer_3_results.xml -outfmt 5 -db nt 28 Jun 2023 18:47:13: Run blastn - speciesprimer_2023_06_25.log task blastn-short -num_threads 4 -query primer.part-4 -evalue 500 -out primer_4_results.xml -outfmt 5 -db nt 29 Jun 2023 03:47:32: Run blastn -task blastn-short -num_threads 4 -query primer.part-5 -evalue 500 -out primer_5_results.xml -outfmt 5 -db nt 29 Jun 2023 13:16:20: Run blastn -task blastn-short -num_threads 4 -query primer.part-6 -evalue 500 -out primer_6_results.xml -outfmt 5 -db nt 29 Jun 2023 22:27:04: Run blastn -task blastn-short -num_threads 4 -query primer.part-7 -evalue 500 -out primer_7_results.xml -outfmt 5 -db nt 30 Jun 2023 07:32:29: > Blast duration: 3 days, 2:13:47 30 Jun 2023 07:32:29: Run: run_blastparser(Azotobacter_chroococcum), primer 30 Jun 2023 07:32:29: Run: blast_parser 30 Jun 2023 07:32:29: Run: blastresults_files(Azotobacter_chroococcum) 30 Jun 2023 07:32:46: > A problem with the BLAST results file /primerdesign/Azotobacter_chroococcum/Pangenome/results/primer/primerblast/primer_4_results.xml was detected. Please check if the file was removed and start the run again 30 Jun 2023 07:32:46: ['fatal error while working on', 'Azotobacter_chroococcum', 'check logfile', '/primerdesign/speciesprimer_2023_06_25.log'] fatal error while working on Azotobacter_chroococcum Traceback (most recent call last): File "/pipeline/speciesprimer.py", line 4168, in main run_pipeline_for_target(target, config) File "/pipeline/speciesprimer.py", line 4082, in run_pipeline_for_target config, primer_dict).run_primer_qc() File "/pipeline/speciesprimer.py", line 3537, in run_primer_qc self.call_blastparser.run_blastparser("primer") File "/pipeline/speciesprimer.py", line 2588, in run_blastparser align_dict = self.blast_parser(self.primerblast_dir) File "/pipeline/speciesprimer.py", line 2518, in blast_parser align_dict = self.bp_parse_xml_files(blast_dir) File "/pipeline/speciesprimer.py", line 2485, in bp_parse_xml_files blastrecords = self.parse_BLASTfile(filename) File "/pipeline/speciesprimer.py", line 2155, in parse_BLASTfile record_list = list(blast_records) File "/usr/local/lib/python3.5/dist-packages/Bio/Blast/NCBIXML.py", line 824, in parse expat_parser.Parse(NULL, True) # End of XML record xml.parsers.expat.ExpatError: no element found: line 3874641, column 0 30 Jun 2023 07:32:46: > Error report: 30 Jun 2023 07:32:46: > for target Azotobacter_chroococcum 30 Jun 2023 07:32:46: > Error 1: 30 Jun 2023 07:32:46: > A problem with the BLAST results file /primerdesign/Azotobacter_chroococcum/Pangenome/results/primer/primerblast/primer_4_results.xml was detected. Please check if the file was removed and start the run again 30 Jun 2023 07:32:46: > for target Azotobacter_chroococcum 30 Jun 2023 07:32:46: > Error 2: 30 Jun 2023 07:32:46: > fatal error while working on Azotobacter_chroococcum check logfile /primerdesign/speciesprimer_2023_06_25.log

I attached the broken file 4 and a working file 3 for comparison. Renamed to txt so github will let me upload. primer.part-4.txt primer.part-3.txt

biologger commented 1 year ago

Hi, From the log it looks like the blast output file is not complete. This may be due to a lot of results and not enough RAM, even 15 GB should be enough. There is a chance that it would work if you reduce the blastseqs to a value below 500. You may try to remove the primerblast directory, change the configuration to blastseqs below 500 and try to re-run the pipeline. Another option may be to use the ref_prok_rep_genomes database, as there is way less redundancy of sequences. For the size of your current nt database it looks as it grew a lot in size in recent years and the actual size seems legitimate. Please tell me if it is working or not, I may need to change the output of the blast results from .xml to .csv/.txt as there I can select the actual data (columns) that are written to the output file, and this may reduce the required RAM. Cheers