donia-lab / MetaBGC

A metagenomic strategy for harnessing the chemical repertoire of the human microbiome
GNU General Public License v3.0
32 stars 8 forks source link

StatisticsError after cd-hit step #9

Closed tamburinif closed 4 years ago

tamburinif commented 4 years ago

Hi there,

I am able to run MetaBGC successfully to completion on the test data, but when I try with my data in FASTA format, I encounter the following error:

Building a new DB, current time: 03/16/2020 13:56:57
Building a new DB, current time: 03/16/2020 13:56:57
New DB name:   /gstore/scratch/u/tamburif/test/metabgc/test/quantify_blastn_result/S18081400408.fasta/S18081400408
New DB name:   /gstore/scratch/u/tamburif/test/metabgc/test/quantify_blastn_result/S18072000209.fasta/S18072000209
New DB title:  S18081400408
New DB title:  S18072000209
Sequence type: Nucleotide
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 12858476 sequences in 287.295 seconds.
Done building BLAST Build on: input2/S18072000209.fasta
blastn -num_threads 1 -query test/CombinedIDFASTASeqs_Drep.fasta -db test/quantify_blastn_result/S18072000209.fasta/S18072000209 -dust no -max_target_seqs 1000000 -perc_identity 95.0 -qcov_hsp_perc 50 -window_size 11 -outfmt "6 sseqid slen sstart send qseqid qlen qstart qend pident evalue" -out test/quantify_blastn_result/S18072000209.txt
Done running BLAST Build on: test/CombinedIDFASTASeqs_Drep.fasta
Adding sequences from FASTA; added 43285294 sequences in 959.386 seconds.
Done building BLAST Build on: input2/S18081400408.fasta
blastn -num_threads 1 -query test/CombinedIDFASTASeqs_Drep.fasta -db test/quantify_blastn_result/S18081400408.fasta/S18081400408 -dust no -max_target_seqs 1000000 -perc_identity 95.0 -qcov_hsp_perc 50 -window_size 11 -outfmt "6 sseqid slen sstart send qseqid qlen qstart qend pident evalue" -out test/quantify_blastn_result/S18081400408.txt
Done running BLAST Build on: test/CombinedIDFASTASeqs_Drep.fasta
Traceback (most recent call last):
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/bin/metabgc", line 8, in <module>
    sys.exit(main())
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 224, in main
    cli()
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 219, in search
    cluster_file = mbgccluster(abund_file,abund_wide_table, ident_reads_file, max_dist, min_samples,min_reads_bin, min_abund_bin, cpu)
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/site-packages/metabgc/src/metabgccluster.py", line 97, in mbgccluster
    outF.write("Average Bin Abundance of Bins with >= {0} Reads: {1}\n".format(readThresh,round(mean(df_bins['BinAbundance'].tolist()),2)))
  File "/gstore/home/tamburif/miniconda3/envs/metabgc/lib/python3.8/statistics.py", line 315, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

Do you have any idea what the problem is here? Is this perhaps a situation in which no BGC genes are found and the program does not exit gracefully? Any advice is greatly appreciated!

abiswas-odu commented 4 years ago

What does your abundance table look like? Seems to me no clusters are being created. Are all the abundances 0 or something?

tamburinif commented 4 years ago

Hi there, the abundance table unique-biosynthetic-reads-abundance-table.txt was not empty and the abundances were not all zero. I can see that there have been some updates to the repo since I first posted this question, so I will try again with the latest version of code.

tamburinif commented 4 years ago

I'm now getting a new error UnboundLocalError: local variable 'dbOut' referenced before assignment after the CD-hit step:

Total CPU time 0.44
Done Running Cd-Hit with: test_041520/identify_result/identified-biosynthetic-reads.fasta
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/src/utils.py", line 124, in MakeSearchBLASTN
    if not os.path.isfile(dbOut):
UnboundLocalError: local variable 'dbOut' referenced before assignment
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/bin/metabgc", line 8, in <module>
    sys.exit(main())
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 227, in main
    cli()
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 218, in search
    abund_file, abund_wide_table = mbgcquantify(ident_reads_file, prot_family_name, cohort_name, nucl_seq_directory,
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/src/metabgcquantify.py", line 63, in mbgcquantify
    RunBLASTNDirectoryPar(nucl_seq_directory, "",cdHitFile, "-dust no -max_target_seqs 1000000 -perc_identity 95.0 -qcov_hsp_perc 50 -window_size 11",blastn_search_directory,CPU_THREADS)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/src/utils.py", line 170, in RunBLASTNDirectoryPar
    MakeSearchBLASTNParallel(dbFileList,existingDbDir,ouputDir, searchFileList , blastParamStr, outFileList)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/src/utils.py", line 142, in MakeSearchBLASTNParallel
    pool.starmap(MakeSearchBLASTN, zip(dbFileList, repeat(existingDbDir),repeat(dbOpPath), searchFileList, repeat(blastParamStr), outFileList))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/multiprocessing/pool.py", line 768, in get
    raise self._value
UnboundLocalError: local variable 'dbOut' referenced before assignment
abiswas-odu commented 4 years ago

Thank for finding this bug. But the development branch is not tested. I am making improvements to metabgc build. Some common functions have to be tested for search.

So, please use the pip release.

Now, considering unique-biosynthetic-reads-abundance-table.txt is not all zeros. Can you please run metabgc cluster again and reproduce the original error. Looking at the error, it might happen after we print the summary file. Please share the BinSummary.txt output.

tamburinif commented 4 years ago

Thanks, I wasn't clear on which version would be best to use. Per your recommendation I reran with the pip release of metabgc, which resulted in the same StatisticsError as in my original post:

Building a new DB, current time: 04/16/2020 12:26:00
New DB name:   /gstore/scratch/u/tamburif/test/metabgc/test_041620/quantify_blastn_result/S18072000209.fasta/S18072000209
New DB title:  S18072000209
Sequence type: Nucleotide
Building a new DB, current time: 04/16/2020 12:26:00
New DB name:   /gstore/scratch/u/tamburif/test/metabgc/test_041620/quantify_blastn_result/S18081400408.fasta/S18081400408
New DB title:  S18081400408
Sequence type: Nucleotide
Keep MBits: T
Keep MBits: T
Maximum file size: 1000000000B
Maximum file size: 1000000000B
Adding sequences from FASTA; added 12858476 sequences in 270.668 seconds.
Done building BLAST Build on: input2/S18072000209.fasta
blastn -num_threads 1 -query test_041620/CombinedIDFASTASeqs_Drep.fasta -db test_041620/quantify_blastn_result/S18072000209.fasta/S18072000209 -dust no -max_target_seqs 1000000 -perc_identity 95.0 -qcov_hsp_perc 50 -window_size 11 -outfmt "6 sseqid slen sstart send qseqid qlen qstart qend pident evalue" -out test_041620/quantify_blastn_result/S18072000209.txt
Done running BLAST Build on: test_041620/CombinedIDFASTASeqs_Drep.fasta
Adding sequences from FASTA; added 43285294 sequences in 926.035 seconds.
Done building BLAST Build on: input2/S18081400408.fasta
blastn -num_threads 1 -query test_041620/CombinedIDFASTASeqs_Drep.fasta -db test_041620/quantify_blastn_result/S18081400408.fasta/S18081400408 -dust no -max_target_seqs 1000000 -perc_identity 95.0 -qcov_hsp_perc 50 -window_size 11 -outfmt "6 sseqid slen sstart send qseqid qlen qstart qend pident evalue" -out test_041620/quantify_blastn_result/S18081400408.txt
Done running BLAST Build on: test_041620/CombinedIDFASTASeqs_Drep.fasta
Traceback (most recent call last):
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/bin/metabgc", line 8, in <module>
    sys.exit(main())
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 224, in main
    cli()
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/metabgc_cmds.py", line 219, in search
    cluster_file = mbgccluster(abund_file,abund_wide_table, ident_reads_file, max_dist, min_samples,min_reads_bin, min_abund_bin, cpu)
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/site-packages/metabgc/src/metabgccluster.py", line 97, in mbgccluster
    outF.write("Average Bin Abundance of Bins with >= {0} Reads: {1}\n".format(readThresh,round(mean(df_bins['BinAbundance'].tolist()),2)))
  File "/gstore/scratch/u/tamburif/miniconda3_envs/metabgc/lib/python3.8/statistics.py", line 315, in mean
    raise StatisticsError('mean requires at least one data point')
statistics.StatisticsError: mean requires at least one data point

Here is the BinSummary.txt output:

Number of Reads Quantified: 5
Number of Samples Quantified: 2
Number of Bins: 2
Number of Bins with >= 10.0 Reads: 0
Number of Bins with >= 5 Reads: 0
Number of Bins with 2-4 Reads: 2
Number of Singleton Read Bins: 0

It looks like a very small number of sequence reads align to BGCs? Is the problem here that that a threshold # of reads or bins is not being met?

Thanks for your help!

abiswas-odu commented 4 years ago

Yes. None of the clusters have >= 10 reads. By default we only keep clusters with >= 10 reads. This can be controlled using --min_reads_bin parameter. Please also see --min_abund_bin parameter.

However, it is crashing without a reasonable error message. I will add code to perform a controlled exit with a warning. Thanks.

tamburinif commented 4 years ago

Wonderful, thanks for that addition!