Closed ljwh2 closed 2 years ago
hi @ljwh2 to help me identify the fields from the database, please could you send one example of the cohort field paired to it's study ID?
@earlEBI could you provide this for James please?
@jdhayhurst Hi, an example: submission: https://www.ebi.ac.uk/gwas/deposition/submission/614bb4a209113d00018ac454
https://www.ebi.ac.uk/gwas/curation/studies/93017736 study id: 93017736 cohort: FinnDiane | FINRISK | FTC
There are two sources to query for this:
SELECT
s.id,
s.accession_id,
p.pubmed_id,
se.cohort
FROM
study s
JOIN
publication p
ON
s.publication_id = p.id
JOIN
study_extension se
ON
s.id = se.study_id
WHERE
s.accession_id is not NULL;
Then, I read both exports in as pandas dataframes and merged the mongo export with the sql export on the accession ID.
python script:
import pandas as pd
# read in data
goci_df = pd.read_csv("goci_exp.csv")
depo_df = pd.read_csv("depe_exp.csv", usecols=["accession", "submissionId"])
# merge together
merged_df = pd.merge(goci_df, depo_df, how='left', left_on='ACCESSION_ID', right_on='accession')
# clean up headers and columns
merged_df = merged_df.drop(columns=["accession"])
merged_df = merged_df.rename(columns={'ID':'study_id', 'ACCESSION_ID': 'accession_id', 'PUBMED_ID': 'pmid', 'COHORT': 'cohort', 'submissionId': 'submission_id'})
# write to file
merged_df.to_csv("cohort_sub_id.tsv", index=False, sep="\t")
Curators need a data dump of information currently entered in the "cohort" field, for all studies including curated publications and submissions. Cohort data should be reported together with the study ID, submission ID, PMID and GCST if available.