Closed HelenePetersen closed 2 years ago
Hi Helene,
I applied your suggestion and got an error at the point 'getdata <- getGeneLengthAndGCContent(biomart_getID$ensembl_gene_id , org="hsa", mode = c("biomart"))'
which shows: Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 240556795 out of -1 bytes received
Did you also face the same error? If so, how did you deal with it?
Thanks!
We have not experienced this issue, looks like it has something to do with the connection to biomart
@HelenePetesen Thank you so much for this. I work on updating it.
@HelenePetesen By any change coud you send the data ?
@HelenePetesen The data should be fixed now.
Hi, I have experienced a problem with the function
TCGAanalyze_Normalization
andgeneInfoHT
table. I am trying to download, prepare and normalize BRCA RNASeq data from TCGA and therefore I’m running the following:The output after normalization contains way less genes than I would expect:
33228 genes are removed out of 56380 genes in the original dataset. A similar issue is also reported in #164.
By investigating the issue it looks like this is because of missing information about many of the ENSEMBL IDs in the geneInfoHT table, since the table only contains information about 23486 genes:
As a solution I have created a new table by calculating GC content normalization and gene length values for all ENSEMBL IDs in biomart using the
getGeneLengthAndGCContent()
function fromEDASeq
. This creates a complete drop-in replacement table for the existinggeneInfoHT
when usingTCGAanalyze_Normalization
, making it so that no genes are now removed when performing GC content normalization.We have used the following code to obtain the complete table (~8 hours of running time)
If this looks fine to you I can open a PR to replace the current geneInfoHT table with this one. Please let me know, thank you!