EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Data clean up prior to making cohort data available #1111

Closed ljwh2 closed 11 months ago

ljwh2 commented 1 year ago

The following data needs cleaning up in the cohort field (in both MONGO and oracle databases), prior to making this data available to users.

  1. The following should be changed to 'UKB': UKBB UK Biobank UKbiobank

  2. 'multiple cohorts' should be 'multiple'

  3. The following should all be changed to 'RS' Rotterdam Rotterdam Study I Rotterdam Study II Rotterdam Study III RS_I RS_II RS_III RS1 RS2 RS3 RSI RSII RSIII

4.'GenScot' should be 'GS:SFHS'

  1. Any commas within the cohort field should be changed to pipes

  2. Any forward slashes within the cohort field should be changed to pipes The only exception would be for the study (PMID 33414549) which gives a URL containing slashes: ("see cohorts in the educational attainment (excluding 23andMe) and cognitive performance GWAS by Lee et al. 2018 https://doi.org/10.1038/s41588-018-0147-3") It might be better to change the specific studies instead: (PMID 33547301 - 3 studies) - 'UKBB/BBJ' (GCST90179151) - 'FTC/NAG-FIN'

  3. Trailing spaces should be removed

  4. 'special characters' eg. 'Ôªø' should be removed Note: Letter case should be disregarded when it comes to consolidating

Smaller issues, fix if possible: All New Diabetics In Scania (ANDIS)' should be 'ANDIS' 'Malmö Diet and Cancer (MDC)' should be 'MDC' 'Children's Hospital of Philadelphia (CHOP)' should be 'CHOP' 'Baependi Heart Study' should be 'Baependi' The following should be changed to 'Estonia': Estonia_Chip Estonia_WGS 'EstBB' should be changed to 'EB' 'Generation R' should be 'Generation_R' The following should be 'GERA': GERA_EA_AFRchip GERA_EUR_LATchip The Resource for Genetic Epidemiology Research on Aging (GERA) Cohort

ljwh2 commented 1 year ago

Please refer to @earlEBI for any questions about the data

ljwh2 commented 1 year ago

Done in dev db and verified, needs re-running in prod

earlEBI commented 1 year ago

Two more secondary issues I noticed, if the other 'smaller issues' above are also going to be fixed: 'RS Study I' should be 'RS' Special characters in 'The “European NAFLD Registry” Metacohort' should be removed