Open oldvalley49 opened 1 month ago
same problem for me, do you solve it?
The same problem and it came always in my 3 datasets. First I tried to remove these NA gene's rows of rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]], and everything went well in the step1 of pre-treating data with 0 bugs. But it came a new problem in the beginning of step2, the model training, that I can't load RNA-pp.h5ad files came from step1, just because of the before deletion of these NA rows: rna = ad.read_h5ad("rna_pp.h5ad") ValueError: Variables annot. var must have number of columns of X (11787), but has 10215 rows.
Then I tried to fill these "chrom" NA by "chrn" and "chromStart&end" NA by some random number, but it still couldn't work when I ran: guidance = scglue.genomics.rna_anchored_guidance_graph(rna, atac) and the error was: ValueError: Not all features are strand specific!
So how should this problem be handled? Around 10% genes fail to assign the ranges and how to do with these genes NA rows? Thanks for advance.
Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:
gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz
with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;adata.var
along with the corresponding columns in adata.X
, which can be done with rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]
.Let me know if these solutions work.
Hi all, and thank you for the report! This is likely caused by the fact that the GTF file being used does not contain annotation for these genes. You may deal with this in two ways:
- Replace
gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz
with the same GTF file used in single-cell expression quantification, which should guarantee that the gene sets match precisely;- Remove those genes from
adata.var
along with the corresponding columns inadata.X
, which can be done withrna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]
.Let me know if these solutions work.
Thanks for reply. I tried 2 and it did work well! Here's my codes: I used 'gencode.v46.chr_patch_hapl_scaff.annotation.gtf.gz' and had 1572 genes which couldn't be matched.
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 11787 rows
na_rows = rna.var.loc[:, ["chromStart", "chromEnd"]].isna().any(axis=1) rna.var[na_rows].loc[:, ["chrom", "chromStart", "chromEnd"]] #1572 NA rows
rna.var.dropna(subset=["chromStart", "chromEnd"], inplace=True) rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].shape[0] #sum 10215 rows
rna = rna[:, rna.var.dropna(subset=["chrom", "chromStart", "chromEnd"]).index]
rna.var.loc[:, ["chrom", "chromStart", "chromEnd"]].dtypes rna.var["chromStart"] = rna.var["chromStart"].astype(int) rna.var["chromEnd"] = rna.var["chromEnd"].astype(int)
thanks for getting back; 1 works well for me. for anyone encountering similar issues in the future: if you are working with data from 10x genomics you can download their reference file from their website.
Hello!
Thank you for developing this tool. I wanted to reach out to see if something was going wrong with my code. I'm using dataset preprocessed with Seurat for GLUE and following the tutorial for scRNA and scATAC integration. When I run this code segment:
scglue.data.get_gene_annotation( pbmc_rna, gtf="gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz", gtf_by="gene_name" )
many of the genes end up getting NaN in chrom, chromStart, and chromEnd. The genes that fail to assign the ranges seem to be those that start with AL such as AL627309.1, AL590822.1. How can I fix this issue? Thank you in advance.