DataONEorg / hashstore

HashStore, a hash-based object store for DataONE data packages
Apache License 2.0
1 stars 1 forks source link

Feature-73: `store_object` Refactor (with References) #74

Closed doulikecookiedough closed 7 months ago

doulikecookiedough commented 8 months ago

This pull request represents the changes required for HashStore to integrate into Metacat - where a multipart request is used to upload an object, and it's respective parts (ex. the data object, form, metadata, etc.) can arrive in a different order with each request. If the data object comes first - we need to be able to store it without providing a pid. Currently, this is not possible as store_object requires a pid argument.

As a result, HashStore has been refactored to allow store_object to be called without supplying a pid. Additionally, objects are stored by their content identifiers (based on the HashStore default store algorithm). This is a switch back to our original proposed design, with the primary difference being the process in which we manage where the content identifier (cid) of the object is located/referenced so that it can be found. Previously, the cid was stored with the sysmeta (metadata document) of the object in the metadata directory. In this refactor, data objects and their respective references are managed via references files in the .../refs/pid/ and .../refs/cid/ folder.

A reference file for a pid is stored in .../refs/pid/ with the permanent address being the sharded (sha256) hash of the pid, and contains the cid of the object it references. A pid ref file can only contain one cid. A reference file for a cid is stored in .../refs/cid/ with the permanent address being the sharded cid itself, and the contents being a list of pids delimited with new lines (\n). So to find an object, you would call find_object(pid) which will return the cid (string). Deleting an object will delete its pid reference, and also remove it from its respective cid reference file.

In conclusion, there will be two paths to store an object: 1) Data comes first - store_object(pid=None, data) with just the data

Summary:

/objects/ └─ d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 /metadata/ └─ 15/8d/7e/55c36a810d7c14479c9...b20d7df66768b04 /refs/ └─ pid/0d/55/5e/d77052d7e166017f779...7230bcf7abcef65e └─ cid/d5/95/3b/d802fa74edea72eb941...00d154a727ed7c2 hashstore.yaml



@iannesbitt and @artntek - Could you guys please help review this pull request when you have scope?
@mbjones - If you have time, I would appreciate some feedback as well.
iannesbitt commented 7 months ago

@doulikecookiedough This looks good, everything looks really solid from a Python perspective. After testing and poking around for an hour I don't see anything major that needs to change for this PR. Two observations:

  1. Some of the function names get quite long, but I appreciate that they are descriptive.
  2. Completing #70 and adding type hints to your function definitions would make the code more readable, but it seems like you are working up to that point.

For good measure I am adding myself as a reviewer and approving.

doulikecookiedough commented 7 months ago

Thank you @artntek and @iannesbitt for reviewing my pull request. I believe I have addressed all the feedback and am proceeding to merge into develop. If there's anything you want me to review further, please let me know and I'll open a new issue to discuss.