Missing genes in summarizedexperiment object

abuturabnaqvi commented 3 years ago

Hello team TCGAbiolinks!

I am having an issue with GDCprepare function. I am retrieving legacy data for gene expression quantification for various cancers. In this experiment I need to select a set of genes and compare their expression between tumor and normal samples. However when setting summarizeExperiment argument as True in GDCprepare it works fine but the problem is that it removes some of the genes from the list (eg total gene count in raw files is ~20500 but in summerized file object the number of genes is ~19470). The genes I have to list are missing from the dataset. Kindly help me with this issue.

tiagochst commented 3 years ago

@abuturabnaqvi You will need to set SummarizedExperiment = FALSE in GDCprepare if you want all genes. We made a decision to use updated gene information from biomart instead of importing the version TCGA used. If you want to use the TCGA annotation to keep all genes you will need to take a look at https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files

Great part of those was retired: http://useast.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000122718

Which genes specifically are missing from your dataset? You are comparing gene names or ENS IDs, the genes names change over time, so you need to check alias names for the missing genes.

abuturabnaqvi commented 3 years ago

I did set SummarizedExperiment = FALSE but then I am only getting the un-annotated file with columns with barcodes and expression values. Is there any way I could annotated the objects manually? I have also checked for the changes in gene names and tried with aliases but those these are still missing.

abuturabnaqvi commented 3 years ago

Hello @tiagochst While I was working to find out a solution to my problem, I came across some weird behavior in the GDCprepare function code. Though, I only checked the relevant block of the code readGeneExpressionQuantification and makeSEfromGeneExpressionQuantification. I found that the line gene.location <- gene.location[!duplicated(gene.location$entrezgene_id),] is having a problem. I took a Gene HIST2H3A and looked it into the instances of gene.location <- get.GRCh.bioMart(genome). I found with both hg19 and hg38 the gene is present in the data set. However, if you run the line gene.location <- gene.location[!duplicated(gene.location$entrezgene_id),] the gene gets removed from the data set despite having different Entrez ID. I tried Unique() instead of !duplicated(), surprisingly it doesn't remove that gene. I would love to understand why this is happening? Since the function is working on Entrez ID column which is having different values even for similar genes.

Regards

tiagochst commented 3 years ago

The issue is shown below. HIST2H3A is mapped to 3 different entrez_ID, that were also mapped to other 2 gene names.

Unfortunately, hg19 data in TCGA was mapped using entrezgene_id. If you use the hg38 data in GDC, since it is mapped to ENSG ID, that gene should still exist.

I am not sure if you want to use hg19 from the legacy archive, but using the harmonized data is recommended.

If I use a unique function, that would break the code logic, since I will have more annotations than gene expression data.

[image: Screen Shot 2021-10-18 at 10.00.40 AM.png]

On Mon, Oct 18, 2021 at 9:23 AM abuturabnaqvi @.***> wrote:

Hello Tiago! While I was working to find out a solution to my problem, I came across some weird behavior in the GDCprepare function code. Though, I only checked the relevant block of the code readGeneExpressionQuantification and makeSEfromGeneExpressionQuantification. I found that the line gene.location <- gene.location[!duplicated(gene.location$entrezgene_id),] is having a problem. I took a Gene HIST2H3A and looked it into the instances of gene.location <- get.GRCh.bioMart(genome). I found with both hg19 and hg38 the gene is present in the data set. However, if you run the line gene.location <- gene.location[!duplicated(gene.location$entrezgene_id),] the gene gets removed from the data set despite having different Entrez ID. I tried Unique() instead of !duplicated(), surprisingly it doesn't remove that gene. I would love to understand why this is happening? Since the function is working on Entrez ID column which is having different values even for similar genes.

Regards

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/476#issuecomment-945763736, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6KEBV7E7MV5YODKQYDUHQNWJANCNFSM5GCN2HCA .

abuturabnaqvi commented 3 years ago

Okay! I get it. Thanks for helping me out. Keep the great work up!

BioinformaticsFMRP / TCGAbiolinks

Missing genes in summarizedexperiment object #476