chrisquince / STRONG

Strain Resolution ON Graphs
MIT License
47 stars 9 forks source link

Apparently chimeric bins produced by merging process #23

Open chrisquince opened 5 years ago

chrisquince commented 5 years ago

One of the problematic data sets is analysed here:

/mnt/gpfs/Hackathon/StrainMetaSim/CoAssembly55

The binning is very poor prior to merging. Just 9 good bins at 75% completeness. See file:

clustering_gt1000_SCG_table_R.csv

Post merging we have 58

An example of a bin that appears to have cogs from multiple species is Bin_9:

/mnt/gpfs/Hackathon/StrainMetaSim/CoAssembly55/binning/group1/Bin_ini/Bin_9/StrainAnalysis

snurk commented 5 years ago

Ok, I think I'm ready to give results on reference-based analysis of Bin9. Just in case, file with contig names is here. So, as I told earlier, contigs do not seem to have any serious misassemblies: for every contig there is a full-length alignment to one of the reference genomes with 99.5%+ identity. Mismatches and indels are totally possible though (those are consensus sequences after all). Also there are two "local" misassemblies in contigs NODE_6994_length_9726_cov_297.455279 and NODE_12677_length_5153_cov_192.402707

Moreover all the alignments hit different strains of the same species, Bartonella bacilliformis, of which we have 4 strains each represented by 3-4 scaffolds:

QUAST assignment of contigs to references is here, but note that most of the contigs align almost equally well to all references so QUAST attribution is somewhat random here.

chrisquince commented 5 years ago

Yes it appears that this bin is not chimeric at all. In general after merging the bin quality seems OK:

$CONCOCT/scripts/Validate.pl --cfile=../binning/group1/Bin_ini/clustering_gt1000_merged.csv --sfile=Contig_Species.csv --ffile=../assembly/spades/group1.fasta N M TL S K Rec. Prec. NMI Rand AdjRand 41414 41413 3.8616e+08 100 114 0.878245 0.906169 0.943188 0.994375 0.795103

Comparing to contig species assignments the overall precision is 90%

The question is then why does the strain resolution not work for this sample? Also why does the merging improve the binning so much?