broadinstitute / ssGSEA2.0

Single sample Gene Set Enrichment analysis (ssGSEA) and PTM Enrichment Analysis (PTM-SEA)
Other
228 stars 77 forks source link

Optimal Gene Filtering for TPM Expression Data in RNA-seq Analysis: Impact of Non-Protein Coding Biotypes on Hallmark Enrichment Analysis #28

Open snijesh opened 8 months ago

snijesh commented 8 months ago

Hello members,

I am currently working with TPM expression data obtained from RNA-seq analysis, and my dataset includes a diverse range of biotypes such as miRNA, lncRNA, pseudogenes, etc., resulting in a total of around 60,000 genes. As I intend to perform enrichment analysis (ssGSEA) using the hallmark gene list from MSigDB, I am faced with a crucial decision regarding whether to filter the data based on biotype='protein coding'.

Given the diverse nature of the genes in my dataset, I am uncertain about the potential impact of including non-protein coding biotypes on the enrichment analysis. Filtering by biotype='protein coding' seems like a logical step to focus on protein-coding genes relevant to the hallmark pathways, but I would like to seek the community's advice and experiences on this matter.

Here are some specific questions to guide the discussion:

  1. In the context of hallmark pathway enrichment analysis, what are the potential advantages and disadvantages of including non-protein coding genes in the dataset?

  2. Has anyone encountered similar scenarios with a diverse set of biotypes in RNA-seq data, and if so, what criteria did you use for gene filtering, especially concerning biotypes?

  3. Are there specific biotypes, such as miRNA, lncRNA, or pseudogenes, that are known to significantly impact or contribute to hallmark pathway enrichment analysis?

  4. How does the choice of gene filtering criteria, specifically regarding biotype, affect the biological interpretation of enrichment analysis results using hallmark gene sets?

I appreciate any insights, experiences, or recommendations the community can provide to help me make an informed decision on whether to filter my RNA-seq data by biotype='protein coding' for hallmark pathway enrichment analysis.

Thank you in advance for your assistance!

drmani commented 8 months ago

As far as I know, MSigDB gene sets contain only protein coding genes. Including non-protein coding biotypes will affect your enrichment scores, and potentially dilute the enrichment signal that may be present. So, the best approach is to filter out non-protein coding transcripts.