Open chrisquince opened 5 years ago
Ok, I think I'm ready to give results on reference-based analysis of Bin9. Just in case, file with contig names is here. So, as I told earlier, contigs do not seem to have any serious misassemblies: for every contig there is a full-length alignment to one of the reference genomes with 99.5%+ identity. Mismatches and indels are totally possible though (those are consensus sequences after all). Also there are two "local" misassemblies in contigs NODE_6994_length_9726_cov_297.455279 and NODE_12677_length_5153_cov_192.402707
Moreover all the alignments hit different strains of the same species, Bartonella bacilliformis, of which we have 4 strains each represented by 3-4 scaffolds:
QUAST assignment of contigs to references is here, but note that most of the contigs align almost equally well to all references so QUAST attribution is somewhat random here.
Yes it appears that this bin is not chimeric at all. In general after merging the bin quality seems OK:
$CONCOCT/scripts/Validate.pl --cfile=../binning/group1/Bin_ini/clustering_gt1000_merged.csv --sfile=Contig_Species.csv --ffile=../assembly/spades/group1.fasta N M TL S K Rec. Prec. NMI Rand AdjRand 41414 41413 3.8616e+08 100 114 0.878245 0.906169 0.943188 0.994375 0.795103
Comparing to contig species assignments the overall precision is 90%
The question is then why does the strain resolution not work for this sample? Also why does the merging improve the binning so much?
One of the problematic data sets is analysed here:
/mnt/gpfs/Hackathon/StrainMetaSim/CoAssembly55
The binning is very poor prior to merging. Just 9 good bins at 75% completeness. See file:
clustering_gt1000_SCG_table_R.csv
Post merging we have 58
An example of a bin that appears to have cogs from multiple species is Bin_9:
/mnt/gpfs/Hackathon/StrainMetaSim/CoAssembly55/binning/group1/Bin_ini/Bin_9/StrainAnalysis