Closed DSuveges closed 1 year ago
@DSuveges ,
Apologies for never responding to this. I missed it! We just fixed an issue with our OpenTargets/Latest dataset where the source data was exceeding the way-to-small 50GB we had allocated for the FTP copy. We've updated that. Is this duplication issue still happening?
Hi Guys,
I'm form OpenTargets. One of our users reported that OT data fetched from S3 has some problem: the data seems to have unexplainable duplication. We believe the problem might due to how the data is synced from EBI ftp. The datasets our pipelines generated via spark are partitioned into smaller chunks with filenames containing a release specific hash. As the hash is different from release to release, the line below probably will not overwrite the content of the S3 buckets, instead, these chunks keep accumulating.
https://github.com/aws-samples/data-lake-as-code/blob/50f57f5b4b81773dfd0a67ab393fe10285899277/scripts/ssmdoc.import.opentargets.latest.json#L30
For more details, please see the issue in our tracker.