EBISPOT / goci

GWAS Catalog Ontology and Curation Infrastructure
Apache License 2.0
26 stars 19 forks source link

Handling Non-Standard Filename Patterns for Metadata Yaml Generation #1330

Closed karatugo closed 1 month ago

karatugo commented 1 month ago

Description

Metadata YAML generation is designed to retrieve MD5 checksums for files based on the provided accession ID. The function currently assumes filenames follow specific patterns: accession_id.tsv or accession_id.tsv.gz, and accession_id.h.tsv or accession_id.h.tsv.gz if harmonised. However, there are cases where filenames include additional patterns, such as GCST90308682_buildGRCh37.tsv, which are not currently handled by the function. This leads to potential mismatches and an inability to retrieve the correct MD5 checksum.

Suggested Enhancement

Modify the function to handle additional filename patterns. One possible approach is to include regular expression matching to account for various patterns while maintaining the current functionality.

See http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90308001-GCST90309000/GCST90308682/

karatugo commented 1 month ago

1281

karatugo commented 1 month ago

Tested with the following.

https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90269001-GCST90270000/GCST90269497/ https://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90269001-GCST90270000/GCST90269497/harmonised/ http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST005001-GCST006000/GCST005529/ http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST005001-GCST006000/GCST005529/harmonised/ http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90308001-GCST90309000/GCST90308682/