Open Targets Community recently reported an overlap in disease terms / multiple entries for disease-gene associations in the data on S3, more details here.
At Open Targets we have reviewed the data at our EBI FTP repository, compared to the data that is available on S3 at
We've also reviewed the process used to get Open Target latest release into S3, here
Taking into account that Open Targets ETL output uses non-overlapping file names and, although the S3 synchronization process cleans up a local folder for receiving Open Targets latest data, see this, before syncing the data to a bucket destination using
aws s3 sync
This process is performed as an aggregation, see this
We believe that may be the cause of this data discrepancy.
This PR addresses it by making sure that the data syncing is performed as a mirroring operation.
Kind Regards,
Manuel
Open Targets.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Open Targets Community recently reported an overlap in disease terms / multiple entries for disease-gene associations in the data on S3, more details here.
At Open Targets we have reviewed the data at our EBI FTP repository, compared to the data that is available on S3 at
We've also reviewed the process used to get Open Target latest release into S3, here
Taking into account that Open Targets ETL output uses non-overlapping file names and, although the S3 synchronization process cleans up a local folder for receiving Open Targets latest data, see this, before syncing the data to a bucket destination using
This process is performed as an aggregation, see this
We believe that may be the cause of this data discrepancy.
This PR addresses it by making sure that the data syncing is performed as a mirroring operation.
Kind Regards, Manuel Open Targets.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.