Closed doulikecookiedough closed 7 months ago
@doulikecookiedough This looks good, everything looks really solid from a Python perspective. After testing and poking around for an hour I don't see anything major that needs to change for this PR. Two observations:
For good measure I am adding myself as a reviewer and approving.
Thank you @artntek and @iannesbitt for reviewing my pull request. I believe I have addressed all the feedback and am proceeding to merge into develop. If there's anything you want me to review further, please let me know and I'll open a new issue to discuss.
This pull request represents the changes required for HashStore to integrate into Metacat - where a multipart request is used to upload an object, and it's respective parts (ex. the data object, form, metadata, etc.) can arrive in a different order with each request. If the data object comes first - we need to be able to store it without providing a
pid
. Currently, this is not possible asstore_object
requires apid
argument.As a result, HashStore has been refactored to allow
store_object
to be called without supplying apid
. Additionally, objects are stored by their content identifiers (based on the HashStore default store algorithm). This is a switch back to our original proposed design, with the primary difference being the process in which we manage where the content identifier (cid
) of the object is located/referenced so that it can be found. Previously, thecid
was stored with the sysmeta (metadata document) of the object in the metadata directory. In this refactor, data objects and their respective references are managed via references files in the.../refs/pid/
and.../refs/cid/
folder.cid
as the permanent address was made to simplify the process of storing an object. This way, we do not need to store objects into temporary files, hold the name and then have a new commit process to move the object when it's "ready". Objects are stored once, and deleted when the client determines to do so.A reference file for a
pid
is stored in.../refs/pid/
with the permanent address being the sharded (sha256) hash of thepid
, and contains thecid
of the object it references. Apid
ref file can only contain onecid
. A reference file for acid
is stored in.../refs/cid/
with the permanent address being the shardedcid
itself, and the contents being a list of pids delimited with new lines (\n
). So to find an object, you would callfind_object(pid)
which will return thecid
(string). Deleting an object will delete itspid
reference, and also remove it from its respectivecid
reference file.find_object
cid
to prevent accidental deletions.cid
reference file is empty, and likewise with thecid
ref file itself.delete_object(pid)
will first remove itspid
from thecid reference file
, delete thecid_reference_file
if its empty, then delete itspid reference file
and lastly, the object itself only if thecid_reference_file
was successfully deleted.In conclusion, there will be two paths to store an object: 1) Data comes first -
store_object(pid=None, data)
with just the datacid
being the permanent addressverify_object(object_metadata, checksum, checksum_algorithm, expected_file_size)
)tag_object(pid, cid)
2) Form comes first, we know the
pid
-store_object(pid, data, ...)
store_object
with the pid (and relevant additional parameters) will not only store the object, but also tag and verify the object. This is an all-in-one method if we receive the form data before the object.Summary:
cid
as the permanent addressstore_object
has been refactored to allow for storing data only..../refs
directory which houses the/cid/..
and/pid/..
references along with the supporting methods and tests to facilitate the tagging process.tag_object
,find_object
verify_object
, but after describing this pull request, feels like it should be added. I would like to get some feedback here to confirm its inclusion.delete_object
,retrieve_object
andget_hex_digest
Public API methods have also been updated to reflect the recent changes/objects/ └─ d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 /metadata/ └─ 15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04 /refs/ └─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e └─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 hashstore.yaml