bxlab / metaWRAP

MetaWRAP - a flexible pipeline for genome-resolved metagenomic data analysis
MIT License
394 stars 190 forks source link

Not find the Binning_refiner.stats as mentioned in Usage_tutorial.md #24

Open Thexiyang opened 6 years ago

Thexiyang commented 6 years ago

Hi, I am checking the data from Bin_refinement module. But I did not find the Binning_refiner.stats, Binning_refiner as mentioned in Usage_tutorial.md. But others are all there. And there are two empty bins in the metaWRAP_bins, which I think should be removed. let me know if this can be an issue.

Thanks,

ursky commented 6 years ago

I actually took out the Binning_refiner bins from the final plot because I thought thout it was confising since the module is also called Bin_refinement. I should probably remove that from the tutorial... If you are curious about what that would look like, the other figure has binsABC, which is actually the same as Binning_refiner. It is the result of running Binning_refiner on all three inputs.

And with the two empty bins, you mean they are .fa files with a size 0 bytes? Is there anything there?

Thexiyang commented 6 years ago

Thanks for the explanation. Now I got it. I suggest to remove it from the tutorial, as it might confuse the beginners like me.

And yes, it is 0 bytes size. But others are fine. So I have in total 264 good bins plus 2 bins with 0 size. Just did not understand why they are there. I need to mention that metaWRAP significantly improved the bin quality.

Just another question. I have two bins with the highest abundance based on the module Quant_bins. But their completeness are the lowest ones (only 50%). I define good bins as -c 50 -x 10. What could be reason for this? I imagine they should have good completeness due to their high abundance.

ursky commented 6 years ago

Can you check if those two bins are in the metaWRAP.stats file?

Thexiyang commented 6 years ago

There are not there. Checkm just ignored them.

ursky commented 6 years ago

One more thing, are they in the binsO folder in the work directory?

Thexiyang commented 6 years ago

they are in binsO

ursky commented 6 years ago

But they are empty there too, right?

Thexiyang commented 6 years ago

yes, the same. all 0 size

ursky commented 6 years ago

And I'm guessing they are also in binsM, but are not empty?

Thexiyang commented 6 years ago

sorry misunderstood your questions. yes, you are right!

Thexiyang commented 6 years ago

sorry misunderstood your questions. yes, you are right!

ursky commented 6 years ago

I found the issue. It looks like the de-replication stage of the bin consolidation resulted in two bins that have no contigs at all. This is an artifact resulting from your low min completion parameter. Basically, ignore them! Everything is good.

For future users, I put a patch into metaWRAP v=0.8.4 that fixes this. It will come out in the next couple weeks.

Thanks for your feedback!

As for your other question about high-abundance bins with poor completion metrics, this is unfortunately very common. I see it in my data all the time. The reason for this is that these high-abundance species also often have high strain heterogeneity. This confuses both the assembler, and the function that estimates contig coverage, resulting in poor bins. If you really care about those organisms, you can try to assemble and bin single samples individually (or in small groups) in hopes that this reduces the coverage and heterogeneity to the point where you can assemble and bin them better.

Thexiyang commented 6 years ago

Thanks!

What about the reassemble? My last try on reassemble module did not work out as it got stuck on one bin for almost 12 hours. Would it be possible to improve the completeness of these target bins by reassemble? I am thinking should I give it another try?

ursky commented 6 years ago

Bin reassembly will most likely moderately increase the bin completion and significantly reduce bin contamination. It won't increase the completion that much. Have a look at the reassembly benchmarks in the publication.

And yeah, the reassembly can be very slow for bins that have a very high number of reads mapping to them. The module runs on all the bins in parallel (limited by your thread count of course), but with 1 thread per bin, which is why its so slow for those very high abundance ones. Its speeds things up for most users, but not all...

I actually just released metaWRAP v=0.8.4, which has a new parallelization option. Now you can chose to run without the parallelization feature, which means the bins will be reassembled one by one, but using all the threads available. This will help you overcome your issue with that one bin!

Thexiyang commented 6 years ago

thanks. I will update it to the new version and rerun reassembly.