Closed kurtwheeler closed 4 years ago
Does this mean that one compendium can be associated with more than one organism?
Right now they are ComputedFiles
which can only have a single organism.
Talked with @kurtwheeler, and we came up with a plan to solve this (The answer to the previous questions is yes).
If it does then we should scream about it so we know these mixed cases exist.
@kurtwheeler do we need to scream? (log an error?) I checked on prod and this seems to happen for a lot of Organisms. For example GORILLA
's biggest platform is IlluminaHiSeq2000
, which is also used by HOMO_SAPIENS
.
These are a few other organisms whose biggest platform is also IlluminaHiSeq2000
PETROMYZON_MARINUS
ANAS_PLATYRHYNCHOS
CIONA_SAVIGNYI
ASTYANAX_MEXICANUS
GADUS_MORHUA
The platform/machine for RNA-seq doesn't tell us anything about the organism. IlluminaHiSeq2000
is a sequencing machine and therefore is general. I think @kurtwheeler is using "biggest platform" in the same sense that we use it when generating QN targets: Affymetrix microarray platform.
@jaclyn-taroni thanks! I updated the condition, now it's getting the biggest platform for samples where has_raw=True, technology="MICROARRAY", is_processed=True
. Does this list make more sense?
Organism | Biggest platform | Other Organism also using this platform |
---|---|---|
ANAS_PLATYRHYNCHOS | chicken | GALLUS_GALLUS |
CHLOROCEBUS_SABAEUS | rhesus | MACACA_MULATTA |
CAPRA_HIRCUS | hgu133a | HOMO_SAPIENS |
MUS | mogene20st | MUS_MUSCULUS |
MUS_MUSCULUS | mouse4302 | HOMO_SAPIENS |
It does make a bit more sense to me -- mallard being measured on chicken, green monkey being measured on rhesus. But I'm not sure we want to give users domestic goat measured on human. If these are all the cases, this is a small enough number where we can make decisions "manually" in conjunction with how many samples meet these conditions (e.g., how many human samples are measured on mouse4302
) provided we can easily get that information...
We want to update the algorithm that groups the organisms that should be merged when creating the compendiums.
Remove the QN Targets that should not exist https://github.com/AlexsLemonade/refinebio/issues/1757
Only consider organisms with QN Targets
@jaclyn-taroni @jashapiro @cgreene I ran the updated algorithm and got the following organism groups for the compendiums, do they look reasonable?
['HOMO_SAPIENS']
['MUS_MUSCULUS', 'MUS_CAROLI', 'MUS_MUSCULUS_CASTANEUS', 'MUS_MUSCULUS_DOMESTICUS', 'MUS_MUSCULUS_MUSCULUS', 'MUS_MUSCULUS_MUSCULUS_X_M._M._CASTANEUS', 'MUS_MUSCULUS_MUSCULUS_X_M._M._DOMESTICUS', 'MUS_MUSCULUS_X_MUS_SPRETUS', 'MUS_SP.', 'MUS_SPRETUS']
['RATTUS_NORVEGICUS', 'RATTUS_NORVEGICUS_ALBUS', 'RATTUS_RATTUS']
['DROSOPHILA_MELANOGASTER', 'DROSOPHILA_MAURITIANA', 'DROSOPHILA_SANTOMEA', 'DROSOPHILA_SECHELLIA', 'DROSOPHILA_SIMULANS', 'DROSOPHILA_TEISSIERI', 'DROSOPHILA_YAKUBA']
['MACACA_MULATTA', 'MACACA_FASCICULARIS', 'MACACA_FUSCATA', 'MACACA_NEMESTRINA', 'MACACA_RADIATA']
['ARABIDOPSIS_THALIANA', 'ARABIDOPSIS_HALLERI', 'ARABIDOPSIS_HALLERI_SUBSP._GEMMIFERA', 'ARABIDOPSIS_LYRATA', 'ARABIDOPSIS_LYRATA_SUBSP._LYRATA', 'ARABIDOPSIS_LYRATA_SUBSP._PETRAEA', 'ARABIDOPSIS_THALIANA_X_ARABIDOPSIS_HALLERI_SUBSP._GEMMIFERA', 'ARABIDOPSIS_THALIANA_X_ARABIDOPSIS_LYRATA']
['GALLUS_GALLUS']
['DANIO_RERIO']
['SACCHAROMYCES_CEREVISIAE', 'SACCHAROMYCES_BAYANUS', 'SACCHAROMYCES_BOULARDII', 'SACCHAROMYCES_CEREVISIAE_BY4741', 'SACCHAROMYCES_CEREVISIAE_CEN.PK113-7D', 'SACCHAROMYCES_CEREVISIAE_EC1118', 'SACCHAROMYCES_CEREVISIAE_S288C', 'SACCHAROMYCES_CEREVISIAE_SK1', 'SACCHAROMYCES_CEREVISIAE_VIN13', 'SACCHAROMYCES_CEREVISIAE_X_SACCHAROMYCES_KUDRIAVZEVII', 'SACCHAROMYCES_PASTORIANUS', 'SACCHAROMYCES_PASTORIANUS_WEIHENSTEPHAN_34/70', 'SACCHAROMYCES_UVARUM']
['BOS_TAURUS', 'BOS_GRUNNIENS', 'BOS_INDICUS']
['ORYZA_SATIVA', 'ORYZA_LONGISTAMINATA', 'ORYZA_SATIVA_INDICA_GROUP', 'ORYZA_SATIVA_JAPONICA']
['ZEA_MAYS']
['SUS_SCROFA', 'SUS_SCROFA_DOMESTICUS']
['CAENORHABDITIS_ELEGANS']
['GLYCINE_MAX', 'GLYCINE_SOJA']
['OVIS_ARIES']
['EQUUS_CABALLUS']
['VITIS_VINIFERA', 'VITIS_AESTIVALIS', 'VITIS_CINEREA_VAR._HELLERI_X_VITIS_RIPARIA', 'VITIS_CINEREA_VAR._HELLERI_X_VITIS_RUPESTRIS', 'VITIS_CINEREA_VAR._HELLERI_X_VITIS_VINIFERA', 'VITIS_HYBRID_CULTIVAR', 'VITIS_RIPARIA', 'VITIS_ROTUNDIFOLIA']
['SCHIZOSACCHAROMYCES_POMBE', 'SCHIZOSACCHAROMYCES_POMBE_972H-']
['POPULUS_TRICHOCARPA', 'POPULUS_ALBA', 'POPULUS_BALSAMIFERA', 'POPULUS_DELTOIDES', 'POPULUS_EUPHRATICA', 'POPULUS_FREMONTII_X_POPULUS_ANGUSTIFOLIA', 'POPULUS_MAXIMOWICZII_X_POPULUS_NIGRA', 'POPULUS_NIGRA', 'POPULUS_SIMONII', 'POPULUS_SP.', "POPULUS_SP._CV._'OKANESE'", "POPULUS_SP._CV._'WALKER'", 'POPULUS_TOMENTOSA', 'POPULUS_TREMULA', 'POPULUS_TREMULA_X_POPULUS_ALBA', 'POPULUS_TREMULA_X_POPULUS_TREMULOIDES', 'POPULUS_TREMULOIDES', 'POPULUS_TRICHOCARPA_X_POPULUS_DELTOIDES', 'POPULUS_X_CANADENSIS']
['XENOPUS_LAEVIS', 'XENOPUS_BOREALIS', 'XENOPUS_LAEVIS_X_XENOPUS_BOREALIS', 'XENOPUS_LAEVIS_X_XENOPUS_MUELLERI', 'XENOPUS_MUELLERI']
['ANOPHELES_GAMBIAE']
['ESCHERICHIA_COLI', 'ESCHERICHIA_COLI_8624', 'ESCHERICHIA_COLI_APEC_O2', 'ESCHERICHIA_COLI_BW25113', 'ESCHERICHIA_COLI_B_STR._REL606', 'ESCHERICHIA_COLI_CFT073', 'ESCHERICHIA_COLI_K-12', 'ESCHERICHIA_COLI_O08', 'ESCHERICHIA_COLI_O157', 'ESCHERICHIA_COLI_SCI-07', 'ESCHERICHIA_COLI_STR._K-12_SUBSTR._DH10B', 'ESCHERICHIA_COLI_STR._K-12_SUBSTR._MC4100', 'ESCHERICHIA_COLI_STR._K-12_SUBSTR._MG1655', 'ESCHERICHIA_COLI_STR._K-12_SUBSTR._W3110', 'ESCHERICHIA_COLI_UTI89']
['TRITICUM_AESTIVUM', 'TRITICUM_CARTHLICUM', 'TRITICUM_MONOCOCCUM', 'TRITICUM_TURGIDUM', 'TRITICUM_TURGIDUM_SUBSP._DICOCCOIDES', 'TRITICUM_TURGIDUM_SUBSP._DURUM']
['MUSTELA_PUTORIUS_FURO']
['PSEUDOMONAS_AERUGINOSA', 'PSEUDOMONAS_AERUGINOSA_PA14', 'PSEUDOMONAS_AERUGINOSA_PAHM4', 'PSEUDOMONAS_AERUGINOSA_PAO1', 'PSEUDOMONAS_AERUGINOSA_TBCF10839', 'PSEUDOMONAS_AERUGINOSA_UCBPP-PA14', 'PSEUDOMONAS_PUTIDA']
['GOSSYPIUM_HIRSUTUM', 'GOSSYPIUM_ARBOREUM', 'GOSSYPIUM_BARBADENSE', 'GOSSYPIUM_HERBACEUM']
['HORDEUM_VULGARE', 'HORDEUM_VULGARE_SUBSP._SPONTANEUM']
['STAPHYLOCOCCUS_AUREUS', 'STAPHYLOCOCCUS_AUREUS_SUBSP._AUREUS_MU50', 'STAPHYLOCOCCUS_AUREUS_SUBSP._AUREUS_N315', 'STAPHYLOCOCCUS_AUREUS_SUBSP._AUREUS_RN4220', 'STAPHYLOCOCCUS_AUREUS_SUBSP._AUREUS_STR._NEWMAN', 'STAPHYLOCOCCUS_AUREUS_SUBSP._AUREUS_USA300']
['CITRUS_SINENSIS', 'CITRUS_CLEMENTINA', 'CITRUS_LIMON', 'CITRUS_MAXIMA', 'CITRUS_RETICULATA', 'CITRUS_RETICULATA_X_CITRUS_TRIFOLIATA', 'CITRUS_UNSHIU', 'CITRUS_X_PARADISI', 'CITRUS_X_TANGELO']
['LEPIDIUM_SATIVUM']
@arielsvn LGTM
I agree - there are some things here that some folks might think are too far, but they can always filter those samples if they feel particularly strongly. Looks good!
All of these organisms will be stored in a ComputedFileAnnotation
Context
We have a compendium for
BOS_INDICUS
despite the fact that all the samples for it were collected using the platform forBOS_TAURUS
.Problem or idea
We should collapse organisms to a single compendium when they share a platform.
Solution or next step
So it's not immediately obvious how we should implement this. The reason I say that is because when we build the dataset in the foreman management command
create_compendia.py
we don't look at platforms. We just add the accession code for every sample associated with the dataset and let the compendium figure out what to do.I guess that while we're doing that we'll have to: