AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
126 stars 19 forks source link

Add newer and additional RNA-seq supported platforms #3281

Closed jaclyn-taroni closed 1 year ago

jaclyn-taroni commented 1 year ago

Issue Number

Closes #3280

Purpose/Implementation Notes

Here I'm adding newer (and some older) Illumina instrument models to our list of supported RNA-seq platforms.

Methods

I am requesting both David and Josh to review due to these methods.

My approach was to snag a TSV of human or mouse RNA-seq data with raw reads that were generated on an Illumina platform from European Nucleotide Archive:

curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d 'result=read_run&query=(tax_eq(9606)%20OR%20tax_eq(10090))%20AND%20instrument_platform%3D%22ILLUMINA%22%20AND%20library_strategy%3D%22RNA-Seq%22&fields=study_accession%2Cinstrument_model%2Ctax_id&format=tsv' "https://www.ebi.ac.uk/ena/portal/api/search"

Then over to R:

# Currently supported platforms
refinebio_supported_df <- read_csv("https://raw.githubusercontent.com/AlexsLemonade/refinebio/ac3170a8cfd14a754282334204c048182963dac9/config/supported_rnaseq_platforms.txt", col_names = FALSE)

# RNA-seq mouse or human data generated on Illumina platform
instrument_models_df <- read_tsv("Downloads/results_read_run_tsv.txt")

# Get a list of unique instrument models for RNA-seq of mouse and human
unique_illumina_models <- instrument_models_df %>% pull(instrument_model) %>% unique()

Then we can look at what platforms are in the public data but not yet in our supported list of platforms with:

setdiff(unique_illumina_models, refinebio_supported_df[[1]])
 [1] "Illumina MiSeq"        "HiSeq X Ten"           "Illumina MiniSeq"     
 [4] "Illumina NovaSeq 6000" "NextSeq 1000"          "Illumina NovaSeq X"   
 [7] "Illumina iSeq 100"     "NextSeq 2000"          "unspecified"          
[10] "HiSeq X Five"          "Illumina HiSeq X"  

I included the platforms that are currently positioned for transcriptome sequencing (by Illumina): Illumina NovaSeq 6000, NextSeq 1000, Illumina NovaSeq X, NextSeq 2000.

Somewhat obviously, I will not be including “unspecified.” There are also a few benchtop sequencers that I didn’t include at this point (MiSeq, iSeq 100, MiniSeq).

For the HiSeq X platforms, I spot-checked a few experiments with this methodology (example using HiSeq X Five):

instrument_models_df %>% filter(instrument_model == "HiSeq X Five") %>% sample_n(5)

It seemed reasonable to add them.

Types of changes

What types of changes does your code introduce?

To my knowledge

Functional tests

N/A

Checklist

Put an x in the boxes that apply.

jaclyn-taroni commented 1 year ago

LGTM. I might include miSeq, depending on how many samples are out there. I know it was a pretty popular instrument, especially for smaller organisms. So if we expect to add new yeast data, I would keep it in there.

I checked the number of RNA-seq and transcriptomic samples assayed on Illumina MiSeq. There were some yeast, yes, but the majority were human and mouse. I spot-checked some random human and mouse experiments, and it wasn't obvious to me that we shouldn't support them, so I added that platform in https://github.com/AlexsLemonade/refinebio/pull/3281/commits/8bc04bb7193e85b6986186a270b3bf9e2d471d52.

jashapiro commented 1 year ago

LGTM. I might include miSeq, depending on how many samples are out there. I know it was a pretty popular instrument, especially for smaller organisms. So if we expect to add new yeast data, I would keep it in there.

I checked the number of RNA-seq and transcriptomic samples assayed on Illumina MiSeq. There were some yeast, yes, but the majority were human and mouse. I spot-checked some random human and mouse experiments, and it wasn't obvious to me that we shouldn't support them, so I added that platform in 8bc04bb.

👍🏼 It wouldn't surprise me if there were a lot pilot studies in there. Or people tired of waiting for the core.