RNA-seq compendia missing samples for large studies

erflynn commented 4 years ago

Context

I downloaded the RNA-seq compendia for mouse and human (3/10/2020) using the links on the website. link for mouse: https://data-refinery-s3-compendia-circleci-prod.s3.amazonaws.com/MUS_MUSCULUS_1_1574233541.zip link for human: https://data-refinery-s3-compendia-circleci-prod.s3.amazonaws.com/HOMO_SAPIENS_1_1574170428.zip

Problem

I found that none of the studies have more than 100 files in the directories, meaning that no study has greater than 100 samples of RNA-seq data. This means that the compendia have 122885 out of 229789 samples for human, and 128079 out of 359142 samples for mouse. The total is the number in the aggregated metadata file for samples (not the full count in the list of experiments, which is more), the number is the file counts.

Did I download the right set of files? Is this something intentional to limit overall size? If so, how can I download the full set of files for these studies?

I have attached lists of studies with over 100 samples that only have 100 samples in the each directory for mouse and human (the files are .csv but loaded as .txt to be able to work in github). The columns are study_acc: the study accession, num_samples: the number of samples listed in the aggregated metadata file, num_files: the number of quant.sf files present in the directory, and num_missing: num_samples-num_files. human_over_100_missing.txt mouse_over_100_missing.txt

kurtwheeler commented 3 years ago

After regenerating the human RNA-seq compendia, I ran the following script to validate it:

import json
import os

with open("aggregated_metadata.json") as f:
    metadata = json.load(f)

total_missing = 0
num_missing_experiments = 0
for experiment_accession_code, experiment in metadata["experiments"].items():
    num_expected = len(experiment["sample_accession_codes"])
    num_found = len(
        [filename for filename in os.listdir(experiment_accession_code) if "quant.sf" in filename]
    )

    if num_expected != num_found:
        num_missing_experiments += 1
        total_missing += num_expected - num_found
        print(f"{experiment_accession_code}: {num_found}/{num_expected}")

print(
    f"A total of {total_missing} quant.sf files were missing from {num_missing_experiments} experiments."
)

It's final output was A total of 75891 quant.sf files were missing from 3023 experiments .

We still seem to be missing ~75k samples from this quantpendia... Why?

kurtwheeler commented 3 years ago

Oh I forgot to mention that we are expecting some samples to be missing for various reasons, but way fewer than we actually see missing:

$ cat filtered_samples_metadata.tsv | wc -l
1826

I wonder if we're just failing to add all the samples we filter to that dict as we go...

erflynn commented 3 years ago

Thanks for looking into this! ~76k is still quite a bit less than what I had missing (how many do you have total?) Is it possible this relates to truncated or empty files (I see a chunk of them definitely).

Q - is this updated / can you let me know when it is? I am wrapping up the paper that uses these data for inferring sex labels and I would love to have the expanded dataset.

kurtwheeler commented 3 years ago

It is updated! I was testing on the most recent one, which is what you'll get if you go to https://www.refine.bio/compendia?c=rna-seq-sample

I don't actually have that number off the top of my head (already freed the space back up), but it sounds like you plan on redownloading it anyway. This seems like it's entirely separate from the truncated/empty files.

We're wrapping up another project we had going on for a while, so I've shifted most of my focus back to refinebio. I'll be digging into this and the truncated files issue more and once I sort them out I'll regenerate the RNA-Seq compendium again.

erflynn commented 3 years ago

great! thank you - if you don't mind tagging me for future updates of this issue, I would love to make sure I stay up-to-date.

AlexsLemonade / refinebio

RNA-seq compendia missing samples for large studies #2211

Context

Problem