BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

CONCOCT v1.0 performance dip in metaWRAP environment #247

Closed ursky closed 4 years ago

ursky commented 5 years ago

Hey there,

As promised, I re-ran the metawrap benchmarks with the new version of metawrap v1.1.4, which includes concoct v1.0). However, the results were a bit disheartening - concoct actually performs worse than before, even in the gut dataset that it used to do well in. I re-ran the older metawrap (v1.0.2, which came with concoct v0.4) on the same assembly, and the concoct did well again. There are two things that changed in the new version - metawrap now cuts up the contigs for concoct, and the version of concoct is newer. Perhaps its my implementation in metawrap? I am not sure what is going on. I attached the binning results from the gut benchmark, and could provide any intermediate files you could need to troubleshoot this.

Archive.zip binning_results

alneberg commented 5 years ago

Hi @ursky,

sorry for the late reply, I have been off from work the last three weeks. This does look strange, especially the completion ranking. Do you merge the cutup contigs again before running the evaluation? Otherwise, that would explain the total drop in completion I guess.

However, I can't detect any problem with the code used in the metawrap implementation...

Do you have the dataset available for me to test it as well?

ursky commented 5 years ago

Yeah its strange... The contigs are merged after the binning, but for some reason most of the assembly remains unbinned by concoct. The gut dataset is the same that was used in the metaWRAP paper: SRA numbers SRR2053273–SRR2053308 (you can use fastq-dump --split-files to get them). There are many files but the reads are quite short and are co-assembled quickly with megahit, if you want to test the binning. Hope you can figure this one out!

alneberg commented 5 years ago

Hmm, that's weird. Concoct is supposed to place all contigs in bins. So it's not unbinned as you describe it. But maybe that wasn't what you meant.

Is it not possible for you to share the contigs and the coverage table? That would both make it easier but most importantly more certain I'm actually using the same input as you are.

ursky commented 5 years ago

That made me looks closer at the why I was getting "unbinned" in metawrap. Found it! Looks like when a contig name ents with 0's, e.g. k93_563049_length_4438_cov_12.0000, concoct decides to rename it to k93_563049_length_4438_cov_12 for some strange reason... This in turn makes metawrap think that contig was unbinned. I dont know if that is intended behavior, but I recommend that should be fixed. In the meantime, I will force metawrap to only consider the contig name before the ., which should fix the issue.

alneberg commented 5 years ago

Ah, nice, then I have something to track down! This is due to the naming scheme used by the splitting of contigs, contig names are suffixed by '.0', '.1', '.2' and so on, which maybe should be changed to something less generic. So the renaming is done by the merging script which believes this contig is split into several parts where this is the first one. Does this problem explain the entire drop in performance?

ursky commented 5 years ago

Most likely - 3/4 of my contigs were not binned because of this. I'll re-run the benchmarks.

ursky commented 5 years ago

I think you dont need to change much in your code - just make sure you are only taking the value after the last .. In python it would be something like contig=".".join(fragment_name.split(".")[:-1]) for extracting the contig name, and fragment_id=fragment_name.split(".")[-1] to get at the identifier. Not sure about in C++.

alneberg commented 5 years ago

The problem is that it's not adding anything to contigs which are not split, so it's a question of distinguishing "12.000" from "12.000.0" in the case you showed. I'm thinking of appending ".concoct_part_0" instead and then use a regular expression to fetch it. That should be more fail safe.

alneberg commented 4 years ago

This should be solved as of #236