EBISPOT / gwas-user-requests

Repository to collect user requests and bug reports for the GWAS Catalog
3 stars 0 forks source link

Duplicate data in trait mappings file due to invisible UTF-8 character. #87

Open spiros opened 1 month ago

spiros commented 1 month ago

I was trying to load gwas_catalog_trait-mappings_r2024-07-27.tsv in a MySQL table and the uniquness constraint kept failing.

I tracked down the issue to a duplicate entry for the Complement C5 levels entry (http://www.ebi.ac.uk/efo/EFO_002027) - one row has a \<FEFF> character (in the word 'Complement') and the other one does not:

Complement C5 levels    complement C5 measurement       http://www.ebi.ac.uk/efo/EFO_0020278    Other measurement       http://www.ebi.ac.uk/efo/EFO_0001444
C<feff>omplement C5 levels      complement C5 measurement       http://www.ebi.ac.uk/efo/EFO_0020278    Other measurement       http://www.ebi.ac.uk/efo/EFO_0001444

Upon inspection, <FEFF> is present in other values in the file as well e.g.

Total cholesterol in IDL meal response (<feff>OrNLSr)
Triglycerides levels in very large VLDL meal response (<feff>OrNLSr)