PGScatalog / pgsc_calc

The Polygenic Score Catalog Calculator is a nextflow pipeline for polygenic score calculation
https://pgsc-calc.readthedocs.io/en/latest/
Apache License 2.0
120 stars 21 forks source link

Improving information about PGS Catalog scoring file versions #348

Open mglev1n opened 3 months ago

mglev1n commented 3 months ago

Description of feature

It would be great if there was some Version Control, such that the state of the PGS Catalog could be re-constructed as of a given date. At minimum, publishing a running change log (eg. when change occured, what was changed, why the change was made, etc.) would be extremely useful. Apologies if this feature already exists - if it does, making it more prominent would be great. A longer-term goal might be to allow users of pgsc_calc to request scores based on a given version/release of the PGS Catalog.

Motivation

My lab and collaborators have noticed that executing the exact same pgsc_calc command that pulls scores from the PGS Catalog has resulted in different output when run on different days. In troubleshooting, we noticed a few issues:

It's great that archived versions of each scorefile are maintained on PGS Catalog FTP site, which eventually allowed us to troubleshoot these issues. However, tracking down these individual scorefile changes is very time consuming, particularly as the number of scores and number of archived versions increases. This problem also raises the potential for broader transparency/reproducibility issues. Thanks for all the hard work making this resource possible!

smlmbrt commented 3 months ago

@mglev1n, thanks for your comments and suggestions! I think there's some easy things we can do to better expose the version the scores (release date in file header and adding that to the report), and we will discuss the feasibility of a changelog on our side.

We do provide md5 for the scores so that versions can be compared. It is possible to download all the scorefiles you want using our python package (https://pypi.org/project/pgscatalog-utils/, https://pygscatalog.readthedocs.io/en/latest/) and use those downloads as a stable input to the pipeline (or extract them from the first run of the pipeline for re-use).