Data normalization before feeding into BooteJTK

okeydokey-gif commented 3 years ago

Hi,

Great work. I'm wondering if the input expression data should be normalized beyond RPKM values before feeding into BooteJTK. For example, if the RPKM expression values at ZT 0 total to 500,000, and the RPKMs at ZT 12 total to 1,000,000, should some further TSS-normalization be done so that the expression values at each ZT all total the same amount? For example, converting the RPKMs at each ZT to a relative abundance out of 100 before running BooteJTK?

Thank you for any guidance on this!

alanlhutchison commented 3 years ago

Hi,

Thanks for reaching out.

You should likely be using transcripts per million (see link for discussion of rpkm, fpkm, and tpm: https://rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) and then log-normalizing them ( something like log_2 (x+1) ) so that their variance is normally distributed (instead of log-normally as count values usually are).

Let me know if this makes sense.

Alan

Alan L. Hutchison, MD, PhD PGY-2, Internal Medicine University of Chicago Medicine he/him/his

On Tue, Feb 2, 2021 at 6:27 PM okeydokey-gif notifications@github.com wrote:

Hi,

Great work. I'm wondering if the input expression data should be normalized beyond RPKM values before feeding into BooteJTK. For example, if the RPKM expression values at ZT 0 total to 500,000, and the RPKMs at ZT 12 total to 1,000,000, should some further TSS-normalization be done so that the expression values at each ZT all total the same amount? For example, converting the RPKMs at each ZT to a relative abundance out of 100?

Thank you for any guidance on this!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/alanlhutchison/BooteJTK/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA3Y77H5LW3KF33Q6FPETBLS5CKAXANCNFSM4W74GMXQ .

okeydokey-gif commented 3 years ago

Hi Alan,

Thank you for your helpful and quick response! I have now normalized my data by TPM and I will also log-normalize the TPM data as you suggested. My issue is that I am working with a meta'omics data set (just piling on the compositional data, I know) where not every feature is represented at every time point. I figured the easiest biological interpretation of the BooteJTK output would be to only use features that were observed at every ZT as input - however this reduction is what causes the total TPM counts start to vary by ZT. If I only consider features that were observed at every ZT, the spread in total TPM between ZTs ranges from ~500,000 to ~800,000.

Given this, do you think it's better to just feed all the features to BooteJTK, even if many features have zero values at some/most ZTs, because that would maintain the total TPM per ZT? Perhaps there's a better way to adjust for this? Maybe including a single line item "other" feature that represents all of the TPM counts removed after reducing the feature set, so that only features hit at every ZT remain, while also maintaining the total TPM per ZT?

Thanks again for any guidance you have!

alanlhutchison / BooteJTK

Data normalization before feeding into BooteJTK #5