denalitherapeutics / archs4

An R interface to query and extract data from the ARCHS4 data
10 stars 1 forks source link

Some meta/genes and meta/transcript entries cannot be associated to official ensembl annotation #4

Open lianos opened 6 years ago

lianos commented 6 years ago

Although we are using the same version of the ensembl gtf files as are used within the ARCHS4 data processing pipeline, there are some genes and transcripts that are not successfully matched up in the create_augmented_feature_info function.

These were the gtf files used to created to attempt to match gene symbols and transcript identifiers:

lianos commented 6 years ago

Try this to quickly see which gene-level annotations we can't match up:

library(archs4)
a4 <- Archs4Repository()
ys <- as.DGEList(a4, "GSE89189", feature_type = "gene", row_id = "symbol")
ye <- as.DGEList(a4, "GSE89189", feature_type = "gene", row_id = "ensembl")
b0rkd <- subset(ys$genes, !h5idx %in% ye$genes$h5idx)

There are 7979 features in b0rkd! That is to say: there are ~8k meta/genes ("human gene symbols") that we can't find using the gene_name column using the Homo_sapiens.GRCh38.90.gtf annotations.