aws-samples / data-lake-as-code

Data Lake as Code, featuring ChEMBL and OpenTargets
MIT No Attribution
166 stars 44 forks source link

OpenTargets dataset update in the S3 buckets #24

Closed DSuveges closed 1 year ago

DSuveges commented 2 years ago

Hi Guys,

I'm form OpenTargets. One of our users reported that OT data fetched from S3 has some problem: the data seems to have unexplainable duplication. We believe the problem might due to how the data is synced from EBI ftp. The datasets our pipelines generated via spark are partitioned into smaller chunks with filenames containing a release specific hash. As the hash is different from release to release, the line below probably will not overwrite the content of the S3 buckets, instead, these chunks keep accumulating.

https://github.com/aws-samples/data-lake-as-code/blob/50f57f5b4b81773dfd0a67ab393fe10285899277/scripts/ssmdoc.import.opentargets.latest.json#L30

For more details, please see the issue in our tracker.

paulu-aws commented 1 year ago

@DSuveges ,

Apologies for never responding to this. I missed it! We just fixed an issue with our OpenTargets/Latest dataset where the source data was exceeding the way-to-small 50GB we had allocated for the FTP copy. We've updated that. Is this duplication issue still happening?

DSuveges commented 1 year ago

Hi @paulu-aws , thanks for getting back to me! So my colleague has opened a PR that supposedly resolved the problem. As the PR got merged and we had no complaints from our users, we considered the issue resolved.