AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Some samples with non-normalized IDs have snuck through #1855

Open cgreene opened 5 years ago

cgreene commented 5 years ago

Context

We have been building a compendia for human, and it has required more RAM than has made sense. @kurtwheeler found that we were trying to build an initial matrix for filtering with more than 500k identifiers. We should have ~25-50k identifiers.

Problem or idea

Many of the identifiers are either unconverted gene identifiers or otherwise IDs that we don't support in our human compendia mmu-miR-665-star_st, hp_hsa-mir-30d_x_st as well as quite a few SNP and other identifiers.

Solution or next step

We should figure out where these are coming from because they increase our RAM requirements ~10 fold. The mouse ones in particular are confusing. These should get pruned before the compendia are created at the stage that prunes genes that appear in very few samples, but it's massively inefficient to have them around. We'll want to reprocess these datasets with repaired code, because it appears that in certain cases our identifier mapping code is not successful.

cgreene commented 4 years ago

We should still figure out which samples these are because they are likely to break any individual attempts to combine datasets with the smasher (inner join is going to return nothing if they are combined with any other datasets).