Data gathering task - COHORT field

ljwh2 commented 3 years ago

Curators need a data dump of information currently entered in the "cohort" field, for all studies including curated publications and submissions. Cohort data should be reported together with the study ID, submission ID, PMID and GCST if available.

jdhayhurst commented 2 years ago

hi @ljwh2 to help me identify the fields from the database, please could you send one example of the cohort field paired to it's study ID?

ljwh2 commented 2 years ago

@earlEBI could you provide this for James please?

earlEBI commented 2 years ago

@jdhayhurst Hi, an example: submission: https://www.ebi.ac.uk/gwas/deposition/submission/614bb4a209113d00018ac454

https://www.ebi.ac.uk/gwas/curation/studies/93017736 study id: 93017736 cohort: FinnDiane | FINRISK | FTC

jdhayhurst commented 2 years ago

There are two sources to query for this:

the GOCI database:

SELECT
s.id,
s.accession_id,
p.pubmed_id,
se.cohort
FROM
study s
JOIN 
publication p 
ON 
s.publication_id = p.id
JOIN
study_extension se
ON
s.id = se.study_id
WHERE 
s.accession_id is not NULL;

The deposition db: used mongodb compass to export the entire study collection as a csv

Then, I read both exports in as pandas dataframes and merged the mongo export with the sql export on the accession ID.

python script:

import pandas as pd
# read in data
goci_df = pd.read_csv("goci_exp.csv")
depo_df = pd.read_csv("depe_exp.csv", usecols=["accession", "submissionId"])
# merge together
merged_df = pd.merge(goci_df, depo_df, how='left', left_on='ACCESSION_ID', right_on='accession')
# clean up headers and columns
merged_df = merged_df.drop(columns=["accession"])
merged_df = merged_df.rename(columns={'ID':'study_id', 'ACCESSION_ID': 'accession_id', 'PUBMED_ID': 'pmid', 'COHORT': 'cohort', 'submissionId': 'submission_id'})
# write to file
merged_df.to_csv("cohort_sub_id.tsv", index=False, sep="\t")

EBISPOT / goci

Data gathering task - COHORT field #485