MAGinator Diagnostics - Githubissues

HenrikSpiegel commented 2 years ago

Initial impressions on MAGinator

Parameters:

Downsample size: 5000
Refined set size: 100

Initial MSE improvement

We can see that for some catalogues we get no improvement while for others we get a rather high drop in MSE. However, the MSE is still quite high.

Interestingly we observe one region with very low MSE:

However unfortunately that coincides with the rRNA containing BGC:

RE comparisons

Here we generally see little difference between the MAG_init (pseudo-random seed) and MAG_best which is in line the generally little change in MSE above.

Note if we look at the median instead of the fitted mu (NegBinom) the picture is a little more stable:

Summarised error:

Here we note that MAGinator performs very poor - this is especially due to the rRNA bgc.

Without the rRNA bgc:

Without the rRNA bgc and using median:

Here it appears that MAG_best outperform MAG_init but is still worse that the intial 5000 set.

Looking into the error at kmer basis

We want to investigate whether there is a signal in the distribution from per kmer error.

In general we want to see if there is a difference between the distribution for the full set, seed and refined gene-sets.

Looking at a single dataset:

Here we generally observe that the refined dataset has a slightly narrower distribution compared to the seed. Note we do not observe a increased accuracy (lesser RE) however we should be careful with looking to direct at the RE as quite a few assumptions goes into the value.

Interestingly we observe for the ranthipeptide group an increased spread from the refined set:

In general we may tentatively argue that the refinement process overall (when it changes it doesn't get stuck on the seed) results in a slightly narrower distribution of RE which is centered close to RE=0.

HenrikSpiegel commented 2 years ago

Can we find any commanities of KMERS within different RE brackets.

We are interested in checking if we can find some trends for the KMERS with RE close to 0 compared to those far from 0.

We therefore wish create a visualization of the coverage of a given bgc (extract True coverage from camisim.sam files) onto which we map the kmers according to their RE.

Tasks:

[x] Extract the .sam coverage values for each bgc.
[x] Create an index mapping the canonical-kmers back onto the bgc.
[x] Create a combined visualization with BGC(coverage+gene annotation) and in addition rugs for the kmers.

Hope is to identify some trends we can use to select good kmers for quantification going forward.

HenrikSpiegel commented 2 years ago

More fine grained investigations

Question 1: Can we improve the NB fit of the reduced set?

Overall? yes and no

In the case on the left the initial seed is quite poor with negative pseudo-R2 and a very broad distribution. Likewise the all_5000 set has a very broad distribution - but appears to otherwise be nicely fitted by the NB distribution.

In contrast, on the plot on the right we observe that the fit for MAG-init and MAG-best is very poor and that MAGinator is not able to recover.

It seems to be a general problem that the refinement procedure if given a poor start tends to fail on finding an optimized set.

Question 2: Can we improve the error distribution?

Again this seems to be both yes and no.

In the first two instances we are able to to generate error curves with are much more dense for the MAG_best than both MAG_init and all_5000. It overall appears that MAG_best > MAG_init ~ all_5000.

Thus, we here find some evidence that we may actually be on the right track. We are not only able to optimize the seed but actually find a subset of kmers which could outperform the larger set!

However, this is not always the case as we see below. However, these are also some of the catalogues where we did not see any improvement between MAG-init and MAG-best

Question 3: Do we observe any connection to the actual BGC structure?

For the improving catalogue: Here we observe that the location in the BGC for the kmer-set is packed quite dense in the initial set and become more spread thought the bgc. However, if we inspect it after ordering after simulated coverage we observe a fairly similar distribution, however the MAG-best appears more smoothly distributed over the coverage-range wherears the seed is grouped in small "clusters".

For a non-improving catalogue

Here we se no change between init and best and at the same time nothing of large interest.

For the ranthipeptide group

Here we observe that the we manages to pick kemrs which are shared which should be equal to those kmers in <q1 group. We can also from the right pane see that the coverage is of lesser importance but here the location is of more importance.

HenrikSpiegel / Screener

MAGinator Diagnostics #31

Initial impressions on MAGinator

Parameters:

Initial MSE improvement

RE comparisons

Summarised error:

Looking into the error at kmer basis

Looking at a single dataset:

Can we find any commanities of KMERS within different RE brackets.

More fine grained investigations

Question 1: Can we improve the NB fit of the reduced set?

Question 2: Can we improve the error distribution?

Question 3: Do we observe any connection to the actual BGC structure?