AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

Mus Musculus metadata, mismatch with column names #2198

Open rando2 opened 4 years ago

rando2 commented 4 years ago

Context

I'm analyzing the Mus Musculus normalized expression compendium (downloaded on March 10, 2020, re-downloaded and confirmed md5sums today, 3/19/2020). Because there are some characters in the TSV column names (e.g., spaces) that cause problems in my downstream processing, I wanted to link each column name with an entry in the metadata file so that I could assign each sample a unique alphanumeric identifier.

Problem or idea

The number of samples (columns) in the MUS_MUSCULUS.tsv dataset itself is 228,708, but the number of samples indicated in both the metadata tsv & json files is 279,781. When I tried to match the column names to the values in the "refinebio_accession_code" field in the metadata tsv, I could not identify metadata records for 168 of the column names. Likewise, comparing the refinebio_accession_code values from the metadata tsv revealed 51,241 samples in the metadata that could not be matched to a column name.

Solution or next step

Given that 228,708 (# columns in normalized expression data) + 51,241 (# metadata entries unmatched in data) - 168 (# column names unmatched in metadata) = 279,781 (# entries in metadata), I wanted to ask whether there were 168 samples that were dropped from the metadata files and 51,241 samples that were dropped from the dataset (either intentionally or inadvertently)? If you could advise which set of samples you'd recommend using in analyses, I'd be very grateful. Thank you!

kurtwheeler commented 4 years ago

Hi @rando2!! Thanks for the well written issue, this is interesting!

It sounds like we've got some investigation to do. It doesn't surprise me to hear that ~51k samples were dropped from the dataset. We have some filters in the process which drop samples for a few reasons. However, the 168 samples missing from the metadata is surprising. I don't know why this would happen. We'll look into it.

In the meantime, perhaps you could just drop those 168 samples from the dataset and use what remains?

In the time since we generated those compendia we've added some logic to generate a metadata file about samples that we drop from the dataset and why. We've also discovered that a lot of the data that was processed by the researchers who conducted the original studies is pretty noisy so we were already going to regenerate these compendia soon anyway. We'll get our ducks in a row to rerun them soon and perhaps the new sample-dropping-tracking logic will cover this issue. If not we'll dig further.

rando2 commented 4 years ago

@kurtwheeler This is super helpful information, thank you so much! I'll just drop those 168 samples for now (and watch out for bead chips!) and keep an eye out for new versions of the compendium. This is my first time working on a compendium and I'm truly amazed at how you've been able to organize so much data so elegantly.