BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

RNA seq read length #401

Closed imkeller closed 3 years ago

imkeller commented 4 years ago

Dear developers, I would like to access the read length of RNA sequencing data using the TCGAbiolinks package. The information is found in the "Reads Group" section of data related to the bam file, for example here: https://portal.gdc.cancer.gov/files/9a27ecb6-4d6b-4b5f-ac24-7da3b67b55cd Is there a way to access this information? Many thanks, Katharina

tiagochst commented 4 years ago

Hi,

You cannot do that with TCGAbiolinks, but you can do that with GenomicsDataCommons https://rpubs.com/tiagochst/GDC_read_length

Would that work?

imkeller commented 4 years ago

Yes, perfect, this works. Thank you!

imkeller commented 4 years ago

I just figured out that there is a problem with this way of accessing read length, because it only allows me to match the read length on patient level. However one patient may have multiple samples with multiple sequencing runs which differ in read length. I could not find any file name/ identifier on sequencing run level, that allows me to link the results of GDCquery() to the read length obtained from GenomicsDataCommons. Do you have an idea how to solve the problem?

tiagochst commented 4 years ago

@imkeller I made some changes to get the sample information instead of the patient: https://rpubs.com/tiagochst/Read_length_GDC

Could you give me more details about what files from GDCquery() do you want match ?

imkeller commented 4 years ago

OK, I managed to match the filename of the RNAseq counts by using the 'downstream_analyses.submitter_id' entry, thanks!