gao-lab / GLUE

Graph-linked unified embedding for single-cell multi-omics data integration
MIT License
365 stars 55 forks source link

scglue.data.get_gene_annotation gives many NaN values. #122

Open oldvalley49 opened 1 month ago

oldvalley49 commented 1 month ago

Hello!

Thank you for developing this tool. I wanted to reach out to see if something was going wrong with my code. I'm using dataset preprocessed with Seurat for GLUE and following the tutorial for scRNA and scATAC integration. When I run this code segment:

scglue.data.get_gene_annotation( pbmc_rna, gtf="gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz", gtf_by="gene_name" )

many of the genes end up getting NaN in chrom, chromStart, and chromEnd. The genes that fail to assign the ranges seem to be those that start with AL such as AL627309.1, AL590822.1. How can I fix this issue? Thank you in advance.

TuDou-PK commented 1 month ago

same problem for me, do you solve it?

zlqq1001 commented 1 month ago

The same problem and it came always in my 3 datasets. First I tried to remove these NA gene's rows of rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]], and everything went well in the step1 of pre-treating data with 0 bugs. But it came a new problem in the beginning of step2, the model training, that I can't load RNA-pp.h5ad files came from step1, just because of the before deletion of these NA rows: rna = ad.read_h5ad("rna_pp.h5ad") ValueError: Variables annot. var must have number of columns of X (11787), but has 10215 rows.

Then I tried to fill these "chrom" NA by "chrn" and "chromStart&end" NA by some random number, but it still couldn't work when I ran: guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac) and the error was: ValueError: Not all features are strand specific!

So how should this problem be handled? Around 10% genes fail to assign the ranges and how to do with these genes NA rows? Thanks for advance.

Jeff1995 commented 1 month ago

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

  1. Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;
  2. Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

zlqq1001 commented 1 month ago

Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:

  1. Replace gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;
  2. Remove those genes from adata.var along with the corresponding columns in adata.X, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index].

Let me know if these solutions work.

Thanks for reply. I tried 2 and it did work well! Here's my codes: I used 'gencode.v46.chr_patch_hapl_scaff.annotation.gtf.gz' and had 1572 genes which couldn't be matched.

dims

rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 11787 rows

find NA

na_rows = rna.var.loc[:, ["chromStart", "chromEnd"]].isna().any(axis=1) rna.var[na_rows].loc[:, ["chrom", "chromStart", "chromEnd"]] #1572 NA rows

delete NA rows

rna.var.dropna(subset=["chromStart", "chromEnd"], inplace=True) rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 10215 rows

delete NA rows in anondata

rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]

change the type to int

rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].dtypes rna.var["chromStart"] = rna.var["chromStart"].astype(int) rna.var["chromEnd"] = rna.var["chromEnd"].astype(int)

oldvalley49 commented 1 month ago

thanks for getting back; 1 works well for me. for anyone encountering similar issues in the future: if you are working with data from 10x genomics you can download their reference file from their website.