Shenhav-and-Korem-labs / SCRuB

Other
25 stars 2 forks source link

Total ASV counts slightly higher for some samples after running SCRUB #10

Closed krcurtis closed 1 year ago

krcurtis commented 1 year ago

When we've run SCRUB on a dataset we noticed that some samples have slightly more total counts in the output ASV table than were in the input ASV table. Why would that happen?

What is SCRUB returning in terms of the variables in the paper? Is it $\Gamma_i$ (where $\Gamma_i$ is the multinomial distribution of the non-contaminate component of sample $x_i$) adjusted for the original counts, and maybe some rounding?

Thanks! PS We are finding SCRUB very useful! Many thanks!

gaustin15 commented 1 year ago

Thank you, super glad to hear you’re finding it useful!

Your questions both come down to the design choice that we made for this package to return the samples in the same form that they’re inputted, i.e. read counts.

So what the functions returns is, as you described, the fitted $\Gamma$ parameters multiplied by the Ns to get back into the read count space. Doing this does expose us to some small rounding-related variations in the read counts in SCRuB’s output, since we can’t always map the exact learned relative abundances to the exact read count totals, which is why you'll sometimes see slightly higher ASV counts in SCRuB's output..

As an alternative, if you think it would be helpful, we could add in an option to return the decontaminated samples directly in relative abundance space and sidestep this rounding detail altogether (which would correspond to returning the fitted $\Gamma$'s). This could make an input like:

scr_out <- SCRuB(data,
         metadata, 
                 control_order = control_order,
         return_relabunds = TRUE
                 )

return samples in relative abundances rather than counts. Please let us know if you think this would be a helpful addition to the package, if so then we’ll add it in.

krcurtis commented 1 year ago

I'm currently running QIIME2 after running SCRuB, so returning read counts is useful for me.

It sounds like SCRuB is not multiplying by $p_i N_i$, just $N_i$? Is that reasonable if the fitted contamination happened to be an extreme value, like 95%, for a low bio-mass sample?

gaustin15 commented 1 year ago

Yeah, it is currently $N$; the original thinking there was along the lines of "if there were no contamination, this is what you would have observed with your read count total".

We were just discussing this topic earlier today and ultimately agreed that it does make more sense to return the $pN$ scaling as you described, since we should have more confidence in the composition of low-contamination samples when running downstream analyses that aren't entirely in relative abundance space (particularly, as you mention, in cases of extremely high contamination and a high variance of contamination levels between samples).

We’ll implement the $pN$ change as you described (I expect to have it added into the main branch in a few hours); I’ll link to this issue when merging it in.

gaustin15 commented 1 year ago

Just added the $pN$ scaling into main. I'm closing this issue for now, but please do reopen if there's anything else to discuss on this topic.