Closed ursky closed 4 years ago
Hi @ursky,
sorry for the late reply, I have been off from work the last three weeks. This does look strange, especially the completion ranking. Do you merge the cutup contigs again before running the evaluation? Otherwise, that would explain the total drop in completion I guess.
However, I can't detect any problem with the code used in the metawrap implementation...
Do you have the dataset available for me to test it as well?
Yeah its strange... The contigs are merged after the binning, but for some reason most of the assembly remains unbinned by concoct. The gut dataset is the same that was used in the metaWRAP paper: SRA numbers SRR2053273–SRR2053308 (you can use fastq-dump --split-files
to get them). There are many files but the reads are quite short and are co-assembled quickly with megahit, if you want to test the binning. Hope you can figure this one out!
Hmm, that's weird. Concoct is supposed to place all contigs in bins. So it's not unbinned as you describe it. But maybe that wasn't what you meant.
Is it not possible for you to share the contigs and the coverage table? That would both make it easier but most importantly more certain I'm actually using the same input as you are.
That made me looks closer at the why I was getting "unbinned" in metawrap. Found it! Looks like when a contig name ents with 0's, e.g. k93_563049_length_4438_cov_12.0000
, concoct decides to rename it to k93_563049_length_4438_cov_12
for some strange reason... This in turn makes metawrap think that contig was unbinned. I dont know if that is intended behavior, but I recommend that should be fixed. In the meantime, I will force metawrap to only consider the contig name before the .
, which should fix the issue.
Ah, nice, then I have something to track down! This is due to the naming scheme used by the splitting of contigs, contig names are suffixed by '.0', '.1', '.2' and so on, which maybe should be changed to something less generic. So the renaming is done by the merging script which believes this contig is split into several parts where this is the first one. Does this problem explain the entire drop in performance?
Most likely - 3/4 of my contigs were not binned because of this. I'll re-run the benchmarks.
I think you dont need to change much in your code - just make sure you are only taking the value after the last .
. In python it would be something like contig=".".join(fragment_name.split(".")[:-1])
for extracting the contig name, and fragment_id=fragment_name.split(".")[-1]
to get at the identifier. Not sure about in C++.
The problem is that it's not adding anything to contigs which are not split, so it's a question of distinguishing "12.000" from "12.000.0" in the case you showed. I'm thinking of appending ".concoct_part_0" instead and then use a regular expression to fetch it. That should be more fail safe.
This should be solved as of #236
Hey there,
As promised, I re-ran the metawrap benchmarks with the new version of metawrap v1.1.4, which includes concoct v1.0). However, the results were a bit disheartening - concoct actually performs worse than before, even in the gut dataset that it used to do well in. I re-ran the older metawrap (v1.0.2, which came with concoct v0.4) on the same assembly, and the concoct did well again. There are two things that changed in the new version - metawrap now cuts up the contigs for concoct, and the version of concoct is newer. Perhaps its my implementation in metawrap? I am not sure what is going on. I attached the binning results from the gut benchmark, and could provide any intermediate files you could need to troubleshoot this.
Archive.zip![binning_results](https://user-images.githubusercontent.com/13810103/54452902-8e1e6300-472c-11e9-9ffd-776c89d63a65.png)