Duplicate files on the disc was indexed by the sync tool during weekly orchestration on landsat scenes.
Also, when sync tool detects that the file is no longer available on the NCI file system, orchestration sync tool updates the location in the index.
These processes created a domino effect resulting in large number of duplicate datasets/locationless parent/archived parent with derived children.
Datasets with such discrepancies needs to be reported and/or archive/delete from file system. And update the index.
Proposed solutions
Following work flow is used to clean up erroneous datasets:
1) Update execute_coherence script to report all such erroneous datasets.
2) Automate erroneous dataset reporting via serverless config.
3) Validate reported datasets and archive them manually (temp solution).
4) Update and run execute_clean script to trash all files on disc whose index were archived.
Reason for this pull request
Duplicate files on the disc was indexed by the
sync
tool during weekly orchestration onlandsat
scenes. Also, whensync
tool detects that the file is no longer available on theNCI
file system, orchestrationsync
tool updates the location in the index. These processes created a domino effect resulting in large number ofduplicate datasets
/locationless parent
/archived parent with derived children
.Datasets with such discrepancies needs to be reported and/or
archive
/delete from file system
. And update theindex
.Proposed solutions
Following work flow is used to clean up erroneous datasets: 1) Update
execute_coherence
script to report all such erroneous datasets. 2) Automate erroneous dataset reporting viaserverless
config. 3) Validate reported datasets and archive them manually (temp solution). 4) Update and runexecute_clean
script to trash all files on disc whose index were archived.