GeoscienceAustralia / dea-orchestration

4 stars 1 forks source link

Cleanup duplicate/locationless parent/archived parent with derived children #86

Closed santoshamohan closed 5 years ago

santoshamohan commented 5 years ago

Reason for this pull request

Duplicate files on the disc was indexed by the sync tool during weekly orchestration on landsat scenes. Also, when sync tool detects that the file is no longer available on the NCI file system, orchestration sync tool updates the location in the index. These processes created a domino effect resulting in large number of duplicate datasets/locationless parent/archived parent with derived children.

Datasets with such discrepancies needs to be reported and/or archive/delete from file system. And update the index.

Proposed solutions

Following work flow is used to clean up erroneous datasets: 1) Update execute_coherence script to report all such erroneous datasets. 2) Automate erroneous dataset reporting via serverless config. 3) Validate reported datasets and archive them manually (temp solution). 4) Update and run execute_clean script to trash all files on disc whose index were archived.