aws-samples / data-lake-as-code

Data Lake as Code, featuring ChEMBL and OpenTargets
MIT No Attribution
162 stars 43 forks source link

Mirror Open Target latest data release, to prevent releases aggregation #25

Closed mbdebian closed 1 year ago

mbdebian commented 1 year ago

Open Targets Community recently reported an overlap in disease terms / multiple entries for disease-gene associations in the data on S3, more details here.

At Open Targets we have reviewed the data at our EBI FTP repository, compared to the data that is available on S3 at

s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/

We've also reviewed the process used to get Open Target latest release into S3, here

Taking into account that Open Targets ETL output uses non-overlapping file names and, although the S3 synchronization process cleans up a local folder for receiving Open Targets latest data, see this, before syncing the data to a bucket destination using

aws s3 sync

This process is performed as an aggregation, see this

We believe that may be the cause of this data discrepancy.

This PR addresses it by making sure that the data syncing is performed as a mirroring operation.

Kind Regards, Manuel Open Targets.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.