Materials-Consortia / OPTIMADE

Specification of a common REST API for access to materials databases
https://optimade.org/specification
Creative Commons Attribution 4.0 International
83 stars 37 forks source link

Supporting indexing of archives in the `/files` endpoint #484

Open ml-evs opened 1 year ago

ml-evs commented 1 year ago

Currently our file entry type has a 1-to-1 mapping with a file on disk, however there are many cases where databases serve archive files that aggregate data on multiple structures. I would like to be able to run an index over the contents of an archive file and list them as separate files entries, e.g.,

{"id": "archive_1_bs_1", "attributes": {"url": "http://example.com/archive.tar.gz", "name": "bs_1.bands", "relpath": "bandstructures/bs_1.bands"}}
{"id": "archive_1_bs_2", "attributes": {"url": "http://example.com/archive.tar.gz", "name": "bs_2.bands", "relpath": "bandstructures/bs_2.bands"}}

where a client can be smart enough to only download the archive once. Each file can then have a relationship with the structure that the data pertains to.

relpath or relative_path is not part of the current spec, but I think it could be useful and easy to add. The archiving mechanism/compression should be handled by our current fields (e.g., "media_type": "application/tar+gzip" above) but we will lose some info on the size/type of the file at relpath after extracting (although the description seems to be the intended way to handle this anyway for files without defined mime-types).

If others agree this is useful then I am happy to concoct a PR.

ml-evs commented 1 year ago

cc @eimrek and @unkcpz

merkys commented 5 months ago

Good idea. I wonder whether we can re-use JSON:API relationships to describe every archive member in a same way the archive itself would be described. For sure we would need to introduce a property for the relative path as name is supposed to hold only the basename.