Open cgreene opened 5 years ago
We should still figure out which samples these are because they are likely to break any individual attempts to combine datasets with the smasher (inner join is going to return nothing if they are combined with any other datasets).
Context
We have been building a compendia for human, and it has required more RAM than has made sense. @kurtwheeler found that we were trying to build an initial matrix for filtering with more than 500k identifiers. We should have ~25-50k identifiers.
Problem or idea
Many of the identifiers are either unconverted gene identifiers or otherwise IDs that we don't support in our human compendia
mmu-miR-665-star_st
,hp_hsa-mir-30d_x_st
as well as quite a few SNP and other identifiers.Solution or next step
We should figure out where these are coming from because they increase our RAM requirements ~10 fold. The mouse ones in particular are confusing. These should get pruned before the compendia are created at the stage that prunes genes that appear in very few samples, but it's massively inefficient to have them around. We'll want to reprocess these datasets with repaired code, because it appears that in certain cases our identifier mapping code is not successful.