Research paper - Githubissues

WisamSaleem commented 4 years ago

Hi. Could you help me with some references from books or research papers that glmGamPoi is built upon? Thanks

const-ae commented 4 years ago

Hi Wisam,

sure :) glmGamPoi is build around the experience of the last ten years from tools mainly designed for bulk RNA-seq: namely edgeR and DESeq/DESeq2. The most relevant papers are:

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8
Anders Simon, & Huber Wolfgang. (2010). Differential expression analysis for sequence count data. Genome Biology. https://doi.org/10.1016/j.jcf.2018.05.006
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139–140. https://doi.org/10.1093/bioinformatics/btp616
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. https://doi.org/10.1093/nar/gks042

In addition, the implementation leverages on-disk data with the DelayedArray and the beachmat packages:

Lun, A. T. L., Pagès, H., & Smith, M. L. (2018). beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. PLoS Comput. Biol., 14(5), e1006135. https://doi.org/10.1371/journal.pcbi.1006135

Furthermore, I recently started to work on differential testing using the quasi-likelihood framework. The most important paper here is:

Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826

In the beginning, I also used

Bandara, U., Gill, R., & Mitra, R. (2019). On computing maximum likelihood estimates for the negative binomial distribution. Statistics and Probability Letters, 148(xxxx), 54–58. https://doi.org/10.1016/j.spl.2019.01.009

to speed up the inference of the overdispersion parameters. However, I recently refactored the code to make it easier maintainable. The new version uses something similar to a run-length encoding for the counts which brings the same performance boost as Bandara et al.'s formulation.

If you are also interested in the statistical background, I learned a lot about generalized linear models from

Dunn, P. K., & Smyth, G. K. (2018). Generalized Linear Models With Examples in R. https://doi.org/10.1007/978-1-4419-0118-7

For a more high-level overview and introduction to the topic, I would recommend to take a look at the Modern Statistics for Modern Biology by Wolfgang Huber (my boss :D) and Susan Holmes. Chapter 2 specifically talks about handling high-throughput count data.

I hope the above list is helpful, if you have anymore question or a curious about a specific topic, just le me know :)

WisamSaleem commented 4 years ago

Great!

Thanks a lot Constantin .

I am working on comparison among some counts data models, i.e. Poisson, quasi and NB with its different varieties like ZINB. I am trying to investigate how good these models for modelling microbial data. I am familiar with DESeq2 and use it quite often.

Wish you a happy weekend

Wisam

From: Constantin notifications@github.com Sent: 29 May 2020 13:03 To: const-ae/glmGamPoi Cc: Wisam Tariq Saleem; Author Subject: Re: [const-ae/glmGamPoi] Research paper (#2)

Hi Wisam,

sure :) glmGamPoi is build around the experience of the last ten years from tools mainly designed for bulk RNA-seq: namely edgeRhttps://bioconductor.org/packages/edgeR/ and DESeq/DESeq2https://bioconductor.org/packages/DESeq2/. The most relevant papers are:

Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8
Anders Simon, & Huber Wolfgang. (2010). Differential expression analysis for sequence count data. Genome Biology. https://doi.org/10.1016/j.jcf.2018.05.006
Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. https://doi.org/10.1093/bioinformatics/btp616
McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. https://doi.org/10.1093/nar/gks042

In addition, the implementation leverages on-disk data with the DelayedArrayhttps://bioconductor.org/packages/DelayedArray/ and the beachmathttps://bioconductor.org/packages/beachmat/ packages:

Lun, A. T. L., Pagès, H., & Smith, M. L. (2018). beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types. PLoS Comput. Biol., 14(5), e1006135. https://doi.org/10.1371/journal.pcbi.1006135

Furthermore, I recently started to work on differential testing using the quasi-likelihood framework. The most important paper here is:

Lund, S. P., Nettleton, D., McCarthy, D. J., & Smyth, G. K. (2012). Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology, 11(5). https://doi.org/10.1515/1544-6115.1826

In the beginning, I also used

Bandara, U., Gill, R., & Mitra, R. (2019). On computing maximum likelihood estimates for the negative binomial distribution. Statistics and Probability Letters, 148(xxxx), 54-58. https://doi.org/10.1016/j.spl.2019.01.009

to speed up the inference of the overdispersion parameters. However, I recently refactored the code to make it easier maintainable. The new version uses something similar to a run-length encoding for the counts which brings the same performance boost as Bandara et al.'s formulation.

If you are also interested in the statistical background, I learned a lot about generalized linear models from

Dunn, P. K., & Smyth, G. K. (2018). Generalized Linear Models With Examples in R. https://doi.org/10.1007/978-1-4419-0118-7

For a more high-level overview and introduction to the topic, I would recommend to take a look at the Modern Statistics for Modern Biologyhttps://www.huber.embl.de/msmb/ by Wolfgang Huber (my boss :D) and Susan Holmes. Chapter 2https://www.huber.embl.de/msmb/Chap-CountData.html specifically talks about handling high-throughput count data.

I hope the above list is helpful, if you have anymore question or a curious about a specific topic, just le me know :)

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/const-ae/glmGamPoi/issues/2#issuecomment-635888891, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AIPMGRCXDZMCK2PER4CNGDLRT6B7VANCNFSM4NNYVXDA.

const-ae commented 4 years ago

Great, yes I think good comparisons are of high interest. I am not an expert for microbial count data, but from what I have heard, it has some of the same challenges as single cell data, so I would be curious to hear how glmGamPoi is doing.

I will close this issue for now, but feel free to reopen if anything else comes up.

Best, Constantin

const-ae / glmGamPoi

Research paper #2