EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Data gathering task - COHORT field #485

Closed ljwh2 closed 2 years ago

ljwh2 commented 3 years ago

Curators need a data dump of information currently entered in the "cohort" field, for all studies including curated publications and submissions. Cohort data should be reported together with the study ID, submission ID, PMID and GCST if available.

jdhayhurst commented 2 years ago

hi @ljwh2 to help me identify the fields from the database, please could you send one example of the cohort field paired to it's study ID?

ljwh2 commented 2 years ago

@earlEBI could you provide this for James please?

earlEBI commented 2 years ago

@jdhayhurst Hi, an example: submission: https://www.ebi.ac.uk/gwas/deposition/submission/614bb4a209113d00018ac454

https://www.ebi.ac.uk/gwas/curation/studies/93017736 study id: 93017736 cohort: FinnDiane | FINRISK | FTC

jdhayhurst commented 2 years ago

There are two sources to query for this:

  1. the GOCI database:
    SELECT
    s.id,
    s.accession_id,
    p.pubmed_id,
    se.cohort
    FROM
    study s
    JOIN 
    publication p 
    ON 
    s.publication_id = p.id
    JOIN
    study_extension se
    ON
    s.id = se.study_id
    WHERE 
    s.accession_id is not NULL;
  2. The deposition db: used mongodb compass to export the entire study collection as a csv

Then, I read both exports in as pandas dataframes and merged the mongo export with the sql export on the accession ID.

python script:

import pandas as pd
# read in data
goci_df = pd.read_csv("goci_exp.csv")
depo_df = pd.read_csv("depe_exp.csv", usecols=["accession", "submissionId"])
# merge together
merged_df = pd.merge(goci_df, depo_df, how='left', left_on='ACCESSION_ID', right_on='accession')
# clean up headers and columns
merged_df = merged_df.drop(columns=["accession"])
merged_df = merged_df.rename(columns={'ID':'study_id', 'ACCESSION_ID': 'accession_id', 'PUBMED_ID': 'pmid', 'COHORT': 'cohort', 'submissionId': 'submission_id'})
# write to file
merged_df.to_csv("cohort_sub_id.tsv", index=False, sep="\t")