Open HenrikSpiegel opened 2 years ago
We are interested in checking if we can find some trends for the KMERS with RE close to 0 compared to those far from 0.
We therefore wish create a visualization of the coverage of a given bgc (extract True coverage from camisim.sam files) onto which we map the kmers according to their RE.
Tasks:
Hope is to identify some trends we can use to select good kmers for quantification going forward.
Overall? yes and no
In the case on the left the initial seed is quite poor with negative pseudo-R2 and a very broad distribution. Likewise the all_5000 set has a very broad distribution - but appears to otherwise be nicely fitted by the NB distribution.
In contrast, on the plot on the right we observe that the fit for MAG-init and MAG-best is very poor and that MAGinator is not able to recover.
It seems to be a general problem that the refinement procedure if given a poor start tends to fail on finding an optimized set.
Again this seems to be both yes and no.
In the first two instances we are able to to generate error curves with are much more dense for the MAG_best than both MAG_init and all_5000. It overall appears that MAG_best > MAG_init ~ all_5000.
Thus, we here find some evidence that we may actually be on the right track. We are not only able to optimize the seed but actually find a subset of kmers which could outperform the larger set!
However, this is not always the case as we see below. However, these are also some of the catalogues where we did not see any improvement between MAG-init and MAG-best
For the improving catalogue: Here we observe that the location in the BGC for the kmer-set is packed quite dense in the initial set and become more spread thought the bgc. However, if we inspect it after ordering after simulated coverage we observe a fairly similar distribution, however the MAG-best appears more smoothly distributed over the coverage-range wherears the seed is grouped in small "clusters".
For a non-improving catalogue
Here we se no change between init and best and at the same time nothing of large interest.
For the ranthipeptide group
Here we observe that the we manages to pick kemrs which are shared which should be equal to those kmers in <q1 group. We can also from the right pane see that the coverage is of lesser importance but here the location is of more importance.
Initial impressions on MAGinator
Parameters:
Downsample size: 5000
Refined set size: 100
Initial MSE improvement
We can see that for some catalogues we get no improvement while for others we get a rather high drop in MSE. However, the MSE is still quite high.
Interestingly we observe one region with very low MSE:
However unfortunately that coincides with the rRNA containing BGC:
RE comparisons
Here we generally see little difference between the MAG_init (pseudo-random seed) and MAG_best which is in line the generally little change in MSE above.
Note if we look at the median instead of the fitted mu (NegBinom) the picture is a little more stable:
Summarised error:
Here we note that MAGinator performs very poor - this is especially due to the rRNA bgc.
Without the rRNA bgc:
Without the rRNA bgc and using median:
Here it appears that MAG_best outperform MAG_init but is still worse that the intial 5000 set.
Looking into the error at kmer basis
We want to investigate whether there is a signal in the distribution from per kmer error.
In general we want to see if there is a difference between the distribution for the full set, seed and refined gene-sets.
Looking at a single dataset:
Here we generally observe that the refined dataset has a slightly narrower distribution compared to the seed. Note we do not observe a increased accuracy (lesser RE) however we should be careful with looking to direct at the RE as quite a few assumptions goes into the value.
Interestingly we observe for the ranthipeptide group an increased spread from the refined set:
In general we may tentatively argue that the refinement process overall (when it changes it doesn't get stuck on the seed) results in a slightly narrower distribution of RE which is centered close to RE=0.