donovan-h-parks / RefineM

A toolbox for improving metagenome-assembled genomes.
GNU General Public License v3.0
63 stars 9 forks source link

Zero coverage for all samples #17

Open cedwardson4 opened 6 years ago

cedwardson4 commented 6 years ago

I am having an issue with "refinem outliers" where I'm getting an error: "invalid value encountered in double_scalars." This originally happened in v0.0.22 but after looking at the github issues pages, I updated to v0.0.23 and still got this issue.

I examined my scaffold_stats output, and found that the coverage values in coverage.tsv were all zero (in both v0.0.22 and v0.0.23). I looked at some output on a different dataset run on v0.0.21 and did not see this.

It is possible this is an issue with how I am calling "refinem scaffold_stats", but I'm not getting a warning that it isn't reading my files correctly (refinem scaffold_stats appears to run without error).

donovan-h-parks commented 6 years ago

Hello. The "scaffold_stats" command has not changed between v0.0.21 and v0.0.23. Can you run the different dataset you have through v0.0.23 to confirm that it isn't a RefineM issue? Assuming this is fine it would suggest something unique about your current dataset which would help narrow down the issue.

cedwardson4 commented 6 years ago

OK. I think this has something to do with my dataset.

The headers of the BAM files and the contigs are "SPAdes"-style but "fixed" with sed and samtools to be compatible with Anvio. for BAM in *.bam do samtools view -H $BAM | sed "s/./_/" | samtools reheader - $BAM > $BAM.ok.bam done

sed 's/./_/g' all_spades_contigs.fasta > all_spades_contigs_fixed.fasta

I also confirmed that the bins have matching headers.

cedwardson4 commented 6 years ago

I believe the problem is that my reads are single reads, not paired end. Example output: [2018-02-06 23:08:18] INFO: Calculating coverage profile for 31b_SL115863_0_trimmed.sorted.bam (10 of 10): Finished processing 355358 of 355358 (100.00%) reference sequences.

total reads: 25707472

# properly mapped reads: 0 (0.0%)
# duplicate reads: 0 (0.0%)
# secondary reads: 914257 (3.6%)
# reads failing QC: 0 (0.0%)
# reads failing alignment length: 4241798 (16.5%)
# reads failing edit distance: 3269395 (12.7%)
# reads not properly paired: 17282022 (67.2%)
donovan-h-parks commented 6 years ago

Hello. That would do it. If you pass the "--all_reads" flag, CheckM will use singletons reads in the coverage estimate.

cedwardson4 commented 6 years ago

OK that worked, but the error in refinem outliers did not disappear.

[2018-02-07 23:33:01] INFO: RefineM v0.0.23 [2018-02-07 23:33:01] INFO: refinem outliers coverage_stats_new_bins/scaffold_stats.tsv bin_outliers_new_bins [2018-02-07 23:33:01] INFO: Reading scaffold statistics. [2018-02-07 23:33:22] INFO: Calculating statistics for 336 genomes over 355358 scaffolds. [2018-02-07 23:33:42] INFO: Reading reference distributions. Finding outliers in 197 of 336 (58.6%) genomes. Unexpected error: <type 'exceptions.FloatingPointError'> Traceback (most recent call last): File "/usr/local/bin/refinem", line 396, in parser.parse_options(args) File "/usr/local/lib/python2.7/dist-packages/refinem/main.py", line 687, in parse_options self.outliers(options) File "/usr/local/lib/python2.7/dist-packages/refinem/main.py", line 280, in outliers options.report_type, outlier_file) File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 484, in identify cov_perc) File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 390, in outlier_info corr_r, _corr_p = pearsonr(gs.median_coverage, stats.coverage) File "/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py", line 3029, in pearsonr r = r_num / r_den FloatingPointError: invalid value encountered in double_scalars

donovan-h-parks commented 6 years ago

Hello. If you can send me your scaffold_stats.tsv file I can look into what is causing the issue. donovan[dot]parks[at]gmail.com

cedwardson4 commented 6 years ago

I've sent you a Google drive link since its 91MB compressed.

donovan-h-parks commented 6 years ago

Hello. The issue is that one of your contigs has zero coverage across all your samples (namely, c_000000311451). This probably speaks to an underlying issue with your assembly and/or read mapping files as lots of the contigs have extremely low coverage. I will add a warning in the next release of RefineM so the program continues to run and reports such problematic contigs, but it is probably best to try and determine why the coverage is so low for some contigs.

If you wish to move forward with the current data, you need to remove c_000000311451 from the scaffold_stats.tsv file.

cedwardson4 commented 6 years ago

Hi, I'm just getting back to this problem. I'm still getting the error after exploring some things as described below.

First, these bins were from 10 samples, 5 depths in a lake, total sample coassembly. 1) Realized that spades never actually finished correctly and that might have been part of the issue? 2) Next, I used a per-depth assembly to bin with metabat2. Still getting zero coverage on a large number of contigs. Realized I assembled with a fastq file that has a different file name than the individual read fastq files (probably because I cat-ted them, but I can't find the original file to verify) 3) bwa mem defaults to edit distance of 0.04, whereas refinem uses 0.02. I tried 0.04 and still got zero coverage contigs. 4) All of the contigs that have zero coverage are smaller than 2500, which was the minimum contig size for my binning with metabat2.

1,2,3 may not be relevant, but 4 is - the bottom line is that for calculating "bin outliers" which is what I want to do, I don't need coverage for these contigs because they aren't in any bins.

donovan-h-parks commented 6 years ago

Hi. Can I consider this issue closed? It appears the zero coverage contigs are real and not an issue of RefineM. If you wish to use RefineM, you will need to remove all zero coverage contigs from any files provided to the program.

cedwardson4 commented 6 years ago

OK. Thanks!

cedwardson4 commented 6 years ago

Reopening this issue, but also this is sort of a follow up to #11. After removing the zero-coverage contigs, remapping, and recalculating scaffold stats, as well as removing single contig bins (as suggested in #11), I was getting a new stats.py error:

prob = _betai(0.5*df, 0.5, df/(df+t_squared)) FloatingPointError: invalid value encountered in double_scalars

After looking around in the code and trying to figure things out, I came across this scipy issue. Which seemed relevant (I'm using two BAM files, so I manually changed the stats.py code to the fix mentioned in the issue. Now, I get an error: File "/usr/local/lib/python2.7/dist-packages/refinem/outliers.py", line 390, in outlier_info corr_r, _corr_p = pearsonr(gs.median_coverage, stats.coverage) File "/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py", line 3010, in pearsonr r = r_num / r_den FloatingPointError: invalid value encountered in double_scalars

I'm struggling to figure this one out, so I'm not sure what is going on here, but I'm also wondering if this pearson calculation is necessary for my data, as on the front page you state the only the "mean absolute error criteria is used"?

donovan-h-parks commented 6 years ago

Can you confirm you are using the latest version of RefineM? If you can send me a simple example that produces this problem I can take a look. Ideally, a single genome and the exact RefineM commands you ran that result in the issue.

SagesWang commented 6 years ago

Hi, I have encountered the same issue. But after I filtered out the contigs whose Genome Id is "unbinned" or "bin.unbinned", this issue was gone. The command I used to do the filtering is as follow: cat scaffold_stats.tsv | awk -F"\t" '{if($2!="unbinned" && $2!="bin.unbinned") print $0}' >scaffold_stats_filter.tsv Hope this information will help you!