NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 0 forks source link

"Granule not found" errors occur when "duplicateHandling" set to "skip" #65

Open chuckwondo opened 2 years ago

chuckwondo commented 2 years ago

For the PSScene3Band collection, setting "duplicateHandling" to "skip" (rather than "replace") to avoid unnecessary ingestion (and related costs), the DiscoverGranules step of the DiscoverAndQueueGranules workflow fails with "granule not found" errors. This is for the same reason as #32. We must somehow prefix the granule IDs with PSScene3Band- before discovery checks for duplicates, but this is a harder task than the fix for #32 because Cumulus provides no means to insert custom logic between the "list granules" step and the "check for duplicates" step, so we cannot tweak the granule IDs after they're listed, but before they're checked as duplicates.

Acceptance criteria: Configuring "duplicateHandling" as "skip" on the PSScene3Band collection does not produce "granule not found" errors during discovery, and properly skips granules that have already been ingested. The logic should also work for other collections, but given that we currently have only the PSScene3Band collection available, testing against other collections is not required at this point.

chuckwondo commented 2 years ago

One approach to consider would be to leverage proxyquire to "inject" custom logic for the list method of the "s3" protocol provider in Cumulus. This could possibly be done by modifying our existing logic that adds the collection name as a prefix to the granule IDs, but rather than doing it after discovery is complete, "inject" the prefixing logic into a custom list method implementation, or subclass the Cumulus S3ProviderClient class and override the list method. This would also require proxyquire to override the providerClientUtils.buildProviderClient function to "intercept" use of the "s3" protocol to use our subclass.