shaistamadad commented 2 years ago

Hi, What is the best resource to get canonical genes TSS information for human genome? I downloaded the TSS for canonical genes from the NCBI browser:

https://genome.ucsc.edu/cgi-bin/hgTables using the following filters: Assembly= (Dec 2013 GRCh38 Group= gene and predictions Track= GENCODE V39 table= known gene

However, there are multiple entries per gene symbol in the output and I am not sure how to filter out the unnecessary information.

Before calculating the distances from each region in accessibility datal, I only retained genes in the TSS matrix which are present in both the atac and the gene expression data:

mira.tl.get_distance_to_TSS(atac_main,
                            tss_data=TSS_Data,
                            peak_chrom='chr',
                            peak_start='start',
                            peak_end='end',
                            gene_id='geneSymbol',
                            gene_chrom='chrom',
                            gene_strand='strand',
                            gene_start='txStart',
                            gene_end='txEnd',
                            genome_file='MIRA/hg38.chrom.sizes')

Then I trained the LITE model using your tutorial:

common_genes= intersection(list(atac_data.var['gene_name']),list(rna_data.var.index))
genes_to_model = set()
# add highly variable genes
genes_to_model.update(list(rna_data.var["gene_ids-0"][rna_data.var.highly_variable.values].index))
# add top 200 from each topic
for i in range(20):
    genes_to_model.update(rna_model.get_top_genes(i, top_n=200))
genes_to_model = genes_to_model - set(['TAFA1','PAKAP-1']) # removed unannotated genes
# (genes_to_model)

genes_to_model_final= intersection(common_genes,list(genes_to_model)) # filter out genes not present in the atac data 

lite_model = mira.rp.LITE_Model(expr_model=rna_model, 
                                accessibility_model=atac_model, 
                                genes=genes_to_model_final)
lite_model.fit(expr_adata=rna_data, atac_adata=atac_main)
lite_model.predict(expr_adata=rna_data, atac_adata=atac_main) # predicts expression levels given the accessibility state

But I get the following error:

ValueError: The following genes for RP modeling were not found in the TSS annotation: AARS, AC008522.1, AC009950.1, AC027682.1, AC087762.1, AC138028.1, AC245100.1, ARMC4, BHMG1, CCDC114, CCDC58, CNTD2, DEC1, FAM122A, FAM86C1, FAM92B, GVQW2, H1FNT, HIST1H1A, HIST1H1C, HIST1H1D, HIST1H1T, HIST1H2AA, HIST1H2AB, HIST1H2AC, HIST1H2AH, HIST1H2BD, HIST1H2BG, HIST1H2BJ, HIST1H2BK, HIST1H3B, HIST1H3D, HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4B, HIST1H4C, HIST1H4D, HIST1H4E, HIST1H4F, HIST1H4H, HIST1H4J, HIST2H2BF, HIST3H3, KARS, KIAA0556, KIAA1211, KIAA1257, LRMP, MARCH1, MARCH11, MARCH2, MARCH4, MARCH8, MARCH9, PIH1D3, RARS, SPERT, TMEM189, TMEM8A, TTC25, VARS, WARS, WDR34, WDYHV1

Note: These genes are not present in the TSS matrix, so I manually filtered them out of the atac and RNA anndata objects, but the error doesn't resolve.

Thanks so much for your help!

AllenWLynch commented 2 years ago

Hi,

For your table, you should select knownCanonical, which is a non-duplicated list of the "canonical" splice variant and TSS chosen to represent each gene. You may then have to join kgXref table to add symbol fields and other helpful metadata.

AL

From: shaistamadad @.> Sent: Tuesday, February 15, 2022 6:19 AM To: cistrome/MIRA @.> Cc: Subscribed @.***> Subject: [cistrome/MIRA] canonical TSS site data for training the LITE Model (Issue #7)

Hi, What is the best resource to get canonical genes TSS information for human genome? I downloaded the TSS for canonical genes from the NCBI browser:

https://genome.ucsc.edu/cgi-bin/hgTables https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgenome.ucsc.edu%2Fcgi-bin%2FhgTables&data=04%7C01%7C%7C39eaf9f0164a4c51a6bd08d9f07d6145%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637805243559674329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CbCvHnoe3vnUS6l0346%2F3%2FQRelxchztbHcqfImyPZ3k%3D&reserved=0 using the following filters: Assembly= (Dec 2013 GRCh38 Group= gene and predictions Track= GENCODE V39 table= known gene

However, there are multiple entries per gene symbol in the output and I am not sure how to filter out the unnecessary information.

Before calculating the distances from each region in accessibility datal, I only retained genes in the TSS matrix which are present in both the atac and the gene expression data:

mira.tl.get_distance_to_TSS(atac_main, tss_data=TSS_Data, peak_chrom='chr', peak_start='start', peak_end='end', gene_id='geneSymbol', gene_chrom='chrom', gene_strand='strand', gene_start='txStart', gene_end='txEnd', genome_file='MIRA/hg38.chrom.sizes')

Then I trained the LITE model using your tutorial:

common_genes= intersection(list(atac_data.var['gene_name']),list(rna_data.var.index)) genes_to_model = set()

add highly variable genes

genes_to_model.update(list(rna_data.var["gene_ids-0"][rna_data.var.highly_variable.values].index))

add top 200 from each topic

for i in range(20): genes_to_model.update(rna_model.get_top_genes(i, top_n=200)) genes_to_model = genes_to_model - set(['TAFA1','PAKAP-1']) # removed unannotated genes

(genes_to_model)

genes_to_model_final= intersection(common_genes,list(genes_to_model)) # filter out genes not present in the atac data

lite_model = mira.rp.LITE_Model(expr_model=rna_model, accessibility_model=atac_model, genes=genes_to_model_final) lite_model.fit(expr_adata=rna_data, atac_adata=atac_main) lite_model.predict(expr_adata=rna_data, atac_adata=atac_main) # predicts expression levels given the accessibility state

But I get the following error:

ValueError: The following genes for RP modeling were not found in the TSS annotation: AARS, AC008522.1, AC009950.1, AC027682.1, AC087762.1, AC138028.1, AC245100.1, ARMC4, BHMG1, CCDC114, CCDC58, CNTD2, DEC1, FAM122A, FAM86C1, FAM92B, GVQW2, H1FNT, HIST1H1A, HIST1H1C, HIST1H1D, HIST1H1T, HIST1H2AA, HIST1H2AB, HIST1H2AC, HIST1H2AH, HIST1H2BD, HIST1H2BG, HIST1H2BJ, HIST1H2BK, HIST1H3B, HIST1H3D, HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4B, HIST1H4C, HIST1H4D, HIST1H4E, HIST1H4F, HIST1H4H, HIST1H4J, HIST2H2BF, HIST3H3, KARS, KIAA0556, KIAA1211, KIAA1257, LRMP, MARCH1, MARCH11, MARCH2, MARCH4, MARCH8, MARCH9, PIH1D3, RARS, SPERT, TMEM189, TMEM8A, TTC25, VARS, WARS, WDR34, WDYHV1

Note: These genes are not present in the TSS matrix, so I manually filtered them out of the atac and RNA anndata objects, but the error doesn't resolve.

Thanks so much for your help!

— Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcistrome%2FMIRA%2Fissues%2F7&data=04%7C01%7C%7C39eaf9f0164a4c51a6bd08d9f07d6145%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637805243559674329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=chiop7Kykaav%2BpN7nYxuvqJrCOLf1PiE9GKm9%2Fekyk8%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPFK7PY5AHVP2B7T3TLU3JAEFANCNFSM5OOLGWBQ&data=04%7C01%7C%7C39eaf9f0164a4c51a6bd08d9f07d6145%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637805243559674329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0Hjf8C8070X%2Ffbie4VLFIbg%2ByNZtiCkBIEE7WrFThxw%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7C%7C39eaf9f0164a4c51a6bd08d9f07d6145%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637805243559674329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=0XTJABl63kNuRBOb9gWGiPt7JfT09kJHdxvFA8rItlE%3D&reserved=0 or Androidhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7C%7C39eaf9f0164a4c51a6bd08d9f07d6145%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637805243559674329%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vo7kc1nr8Pc%2FDoH3ccLUgYcxuizjyboWdqwNXR4BWPo%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

shaistamadad commented 2 years ago

thank you!

cistrome / MIRA

canonical TSS site data for training the LITE Model #7

add highly variable genes

add top 200 from each topic

(genes_to_model)