Non-redundant cohort labels in the ancestry file

DSuveges commented 11 months ago

Hi Team, I have noticed that the COHORT(S) column in the ancestry files sometimes contains the same label multiple times. For example: GCST90002355. The cohort field looks like this:

Airwave|BBJ|BioME|BioME|BioME|CaPS|Estonia|Estonia|FHS|FINCAVAS|GERA|GERA|GERA|GERA|GERA|HANDLS|INTERVAL|JHS|MESA|MESA|MESA|MHIphase1|MHIphase2|SHIPNATREND|UKB|UKB|UKB|UKB|WH

To reproduce:

curl -s ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/2023/11/24/gwas-catalog-download-ancestries-v1.0.3.txt  \
    | grep GCST90002355 \
    | head -n1  \
    | cut -f1,18

It's not a blocker on our side, I can de-duplicate them, however I was really wondering if the repeted labels represent some sort of extra information we might want to capture. Thanks.

ljwh2 commented 11 months ago

HI Daniel, In this case the user submitted the duplicates. This happens sometimes, we don't know why exactly but could be that they copied them from another table where the samples were broken down by ancestry group. Normally the curator would remove the duplicates, I guess it wasn't done for these.

To directly answer your question, no it doesn't represent any additional information, you can safely de-duplicate them

DSuveges commented 11 months ago

Thank you @ljwh2 for the reply. Your answer made us confident that it is fine to de-duplicate the cohort list. Thanks.

EBISPOT / gwas-user-requests

Non-redundant cohort labels in the ancestry file #83