Some samples with non-normalized IDs have snuck through

Context

We have been building a compendia for human, and it has required more RAM than has made sense. @kurtwheeler found that we were trying to build an initial matrix for filtering with more than 500k identifiers. We should have ~25-50k identifiers.

Problem or idea

Many of the identifiers are either unconverted gene identifiers or otherwise IDs that we don't support in our human compendia mmu-miR-665-star_st, hp_hsa-mir-30d_x_st as well as quite a few SNP and other identifiers.

Solution or next step

We should figure out where these are coming from because they increase our RAM requirements ~10 fold. The mouse ones in particular are confusing. These should get pruned before the compendia are created at the stage that prunes genes that appear in very few samples, but it's massively inefficient to have them around. We'll want to reprocess these datasets with repaired code, because it appears that in certain cases our identifier mapping code is not successful.

AlexsLemonade / refinebio