NASA-PDS / registry

PDS Registry provides service and software application necessary for tracking, searching, auditing, locating, and maintaining artifacts within the system. These artifacts can range from data files and label files, schemas, dictionary definitions for objects and elements, services, etc.
https://nasa-pds.github.io/registry
Apache License 2.0
3 stars 2 forks source link

As a user, I want to harvest and register alternate data file paths #86

Open mdrum opened 2 years ago

mdrum commented 2 years ago

šŸ’Ŗ Motivation

...so that I can continue to maintain a reasonably sized archive, and not decompress millions of fit.gz files in order for them to be referenced.

šŸ“– Additional Details

Our archive has an agreement with NSSDCA that allows us to host compressed versions of FITS images for certain bundles. In fact, they prefer it that way. This is a standard compression and decompression system (fpack, funpack) that can create files with these extensions:

The labels still reference the *.fit file, and thus the harvester cannot find the appropriate data object for each product. We would still like products in the registry to be referenced by their File URLs, but it would be a heavy burden to decompress the files in order to allow them to be referenced by their canonical filenames.

āš–ļø Acceptance Criteria

Given a bundle with compressed FITS files with the above alternate file extensions, and labels that reference the uncompressed .fit files, when I run the harvest tool, it would crawl the archive and match the compressed files with their products and register the products with those alternate paths.

āš™ļø Engineering Details

jordanpadams commented 2 years ago

@mdrum is this fpack file referenced anywhere in the labels? even if that is not the case now, there should be some designation in the label like "compressed filename" and included as supplemental data products or something like that? we can build this into the registry, but it seems like a hack to support something I feel will become (and probably should become) more common in the future.

jordanpadams commented 2 years ago

@mdrum as an interim solution, this could be added to the registry using Product Metadata Supplemental and the supplementer: https://nasa-pds.github.io/registry/install/tools.html#supplementer

jordanpadams commented 2 years ago

looks like the docs for operating that software didn't make it into the transition to the latest documentation, but that raw version of those docs can be found here: https://github.com/NASA-PDS/pds-registry-app/blob/main/src/site/xdoc/operate/supplementer.xml.vm

mdrum commented 2 years ago

@jordanpadams To clarify, the issue right now is that Harvest does not detect the respective data files for each label, and thus it can't be ingested. This is technically because the archive at-rest is invalid (the labels point to files that don't exist). So, supplementing the registry with this metadata would make sense, but it would still have to be something baked into the tools in order to first be able to harvest them (and then direct users to the compressed files via the API). I definitely wish we could just uncompress these bundles and not have to worry about them, but my boss keeps telling me hard drives cost money.

If this is an unreasonable request, which I understand it might be, we could potentially use this as a way to ask for more money :)

jordanpadams commented 2 years ago

@mdrum it is definitely not an unreasonable request. we will come up with some alternatives and discuss at a future SWG