Syksy / curatedPCaData

Bioconductor R-package: Curated Prostate Cancer Data
Creative Commons Attribution 4.0 International
9 stars 4 forks source link

ENSG ids in Friedrich #15

Closed Syksy closed 3 years ago

Syksy commented 3 years ago

Part of gex in Friedrich don't have gene names but instead ENSG#######

> head(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE))
[1] "ENSG00000083622" "ENSG00000115934" "ENSG00000121388" "ENSG00000124593" "ENSG00000124835" "ENSG00000132832"
> length(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE))
[1] 7241

Need to homogenize them to be hugo symbols instead all the way.

Fedster commented 3 years ago

I’ll do that

Cheers

F

On 19 May 2021, at 21:07, T. D. Laajala @.***> wrote:

Part of gex in Friedrich don't have gene names but instead ENSG#######

head(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE)) [1] "ENSG00000083622" "ENSG00000115934" "ENSG00000121388" "ENSG00000124593" "ENSG00000124835" "ENSG00000132832" length(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE)) [1] 7241

Need to homogenize them to be hugo symbols instead all the way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Federico Calboli @.***

Fedster commented 3 years ago

Ok, I did rerun the data thinking I did make a mistake, and I did re-push the friedrich gex, but it is an issue in the curatedPCaData_genes

curatedPCaData_genes[which(curatedPCaData_genes[,1] == 'ENSG00000083622'), ] ensembl_gene_id ensembl_transcript_id hgnc_symbol refseq_mrna 218329 ENSG00000083622 ENST00000456270 chromosome_name start_position end_position 218329 7 117604791 117647415 description 218329 novel transcript, antisense to CFTR

these 6 ENSMBL gene IDs do not have a hgnc_symbol in the table.

Cheers

Federico

On 19 May 2021, at 21:07, T. D. Laajala @.***> wrote:

Part of gex in Friedrich don't have gene names but instead ENSG#######

head(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE)) [1] "ENSG00000083622" "ENSG00000115934" "ENSG00000121388" "ENSG00000124593" "ENSG00000124835" "ENSG00000132832" length(grep("ENSG00", rownames(mae_friedrich[["gex"]]), value=TRUE)) [1] 7241

Need to homogenize them to be hugo symbols instead all the way.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

-- Federico Calboli @.***

Syksy commented 3 years ago

Thanks - it's about differences in gene annotations and how they're structured in various databases, the ENSG-genes or ENST-transcripts don't have a 1:1 mapping to Hugo gene symbols, so in case of ambiguity or missing symbols, we'll most likely have to try either collapse multiple instances or omit genes without a gene symbol.

Syksy commented 3 years ago

Friedrich et al. has now been processed from raw data in v0.6.21 using limma pipeline for Agilent one-color arrays in generate.R, including the mapping to hugo symbols and collapsing probes while removing rows without hugo symbols.