TheJacksonLaboratory / ExperimentalModelSchema

Experimental Model Schema
https://thejacksonlaboratory.github.io/ExperimentalModelSchema/
MIT License
1 stars 0 forks source link

MGI Strain ID available for only 40% of the strains in MPD #9

Open hansenp opened 1 year ago

hansenp commented 1 year ago

In section EMS building blocks - Strain, two examples of strain type messages are shown:

{
    "strainType": {
        "id": "MGI:2670463",
        "label": "C57BL/A"
    },
    "strainAttribute": ["INBRED_STRAIN"]
}

and

{
    "strainType": {
        "id": "MGI:4839003",
        "label": "B6.Cg-Tg(Myh6-Nox4)1Ams"
    },
    "strainAttribute": ["CONGENIC", "MUTANT_STRAIN", "TRANSGENIC"]
}

In both examples, an ID from MGI is specified as the strain ID. The Mouse Phenome Database (MPD) uses its own internal strain IDs and only 40% of the strains in MPD also have a strain ID from MGI. One of the goals of this project is to represent the entire MPD through EMS packages. How do we want to handle the 60% of cases where only an internal MPD Strain ID is available?

pnrobinson commented 1 year ago

We should create a list of requirements for moving this project forward. We will not be able to do this curation on our own, but we can create a list of things that will require manual curation. For now, we can put in some MGI id and use the MPD information for the label. I wonder if there is a general MGI id we could use for this (superclass of all of the others?)

sbello commented 7 months ago

It might be useful to prepare a list of MPD strains not in MGI and document the method for determining a strain is not in MGI. This could be sent to MGI for review and either the missing strains added to MGI or mappings to existing strains created.

hansenp commented 5 months ago

The following code creates a table containing all strains from MPD that are not linked to MGI:

# Download the straininfo.csv file
!curl https://phenomedoc-prod.jax.org/MPD_downloads/straininfo.csv > straininfo.csv
# Read file into pandas dataframe
import pandas as pd
df_strains = pd.read_csv('straininfo.csv')
df_strains = df_strains.loc[:, ['mpd_strainid','strainname','mginum', 'url', 'vendor', 'stocknum','straintype']]
# Get all rows that have no mginum
df_strains_wo_mgi = df_strains[df_strains['mginum'].isnull()]
# Write to tsv
df_strains_wo_mgi.to_csv('straininfo_wo_mgi.tsv', sep='\t', index=False)
df_strains_wo_mgi
image

1600 out of a total of 4616 strains in MPD are not linked to MGI. A table with a complete list of unlinked strains is attached. straininfo_wo_mgi.csv

hansenp commented 5 months ago

straininfo_wo_mgi.csv

sbello commented 5 months ago

Thank you! I've passed this on to Cindy to see what MGI may be able to do.

cindyJax commented 3 months ago

We have reviewed all the MPD strains and find that most are indeed in MGI. There remain questions regarding the matching methodologies. Are you matching by IDs or nomenclature? MPD strains have a number of nomenclature issues that may result in strains not found, unless strain synonyms are also considered. For example, one common issue is using F1 abbreviations instead of the complete cross. One large set of strains that are not in MGI are the "PreCC" cohorts. These are not fully inbred strains, but if there is data associated with these sets, we can probably enter these in MGI. There are a handful of private, reserved strains in the list as well; these do not appear to be public in MPD.