LieberInstitute / recount3

Explore and download data from the recount3 project
http://lieberinstitute.github.io/recount3
31 stars 4 forks source link

[Feature Request] Restore information from Matrix Market files in recount3 #40

Open jiapeiyuan17 opened 9 months ago

jiapeiyuan17 commented 9 months ago

Hi Ben and Kasper,

Now we are conducting a project utilizing data from GTEx project. We are particularly interested in the resource presented in recount3 and would like to seek clarification on two specific points:

  1. In your method, you mentioned that "When STAR performs spliced alignment, it outputs a high-confidence collection of splice-junction calls in a file named (SJ.out.tab)". And in the recount3, we could get the Matrix Market file. Can you confirm whether these aggregated files contain the information found in the last three columns of the SJ.out.tab file?
  2. If so, is there a way to convert the Matrix Market file back to bed file with the counts of junction reads?

Your prompt response ​to these inquiries would be greatly appreciated. Thank you for your attention to this matter.

Best, Jiapei

ChristopherWilks commented 9 months ago

Hi Jiapei,

Thanks for your interest in recount3!

For 1., the recount3 matrix market files are derived from the aggregate SJ.out.tab files across the samples for a particular study (or tissue in the case of GTEx v8). I'll have to double check if we did any additional filtering (since it's been a while), but the contents should be the vast majority of what was SJ.out.tab files.

For 2. given that you want the splice junctions in a bed file of counts you're probably best off using Snaptron's re-formatted version of the GTEx v8 junctions in recount3:

https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz

The header file is: https://snaptron.cs.jhu.edu/data/junctions.header.tsv

You'll also want to (minimally) download the GTEx samples description TSV: https://snaptron.cs.jhu.edu/data/gtexv2/samples.tsv

where the rail_id column (first column) is the sample ID that appears in the comma delimited nested list (field samples in the junctions file) for each junction to define which GTEx samples it appears in (has at least one read supporting). That field also contains the spliced read count of the junction for that sample, e.g. :,...

Chris

ChristopherWilks commented 9 months ago

Also, I should point out, the .bgz file is a gzip-compatible block-gzip format that can be read by gzip or pigz. But there's also the Tabix index file: https://snaptron.cs.jhu.edu/data/gtexv2/junctions.bgz.tbi which you can use to quickly query a genomic coordinate range of junctions as well.

lcolladotor commented 9 months ago

Hi,

It sounds that thanks to Chris we can close this issue. Is that right Jiapei?

Best, Leo