DataONEorg / hashstore

HashStore, a hash-based object store for DataONE data packages
Apache License 2.0
1 stars 1 forks source link

`tag_object` fails to update `cid_refs_file` #97

Closed doulikecookiedough closed 1 month ago

doulikecookiedough commented 1 month ago

The object associated with arctic-data.6188.1 exists, along with a cid_refs_file, but the cid_refs_file is not tagged with arctic-data.6188.1. There appears to be a syncing issue when tagging objects. This occurred approximately 5 times when attempting to store the first 100000 data objects from test.arcticdata.io that are less than 1gb in size. Investigate the tag_object process.

Example Log for pid arctic-data.6188.1

2024-05-23 14:37:59 - DEBUG - FileHashStore - store_object: Request to store object for pid: arctic-data.6188.1
2024-05-23 14:37:59 - DEBUG - FileHashStore - store_object: Adding pid: arctic-data.6188.1 to object_locked_pids.
2024-05-23 14:37:59 - DEBUG - FileHashStore - store_object: Attempting to store object for pid: arctic-data.6188.1
2024-05-23 14:37:59 - DEBUG - FileHashStore - put_object: Request to put object for pid: arctic-data.6188.1
2024-05-23 14:37:59 - DEBUG - FileHashStore - _move_and_get_checksums: Creating temp file and calculating checksums for pid: arctic-data.6188.1
2024-05-23 14:37:59 - DEBUG - FileHashStore - put_object: Successfully put object for pid: arctic-data.6188.1
2024-05-23 14:37:59 - DEBUG - FileHashStore - tag_object: Tagging object cid: {a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f} with pid: {arctic-data.6188.1}.
2024-05-23 14:37:59 - DEBUG - FileHashStore - write_cid_refs_file: Writing id (arctic-data.6188.1) into file: /home/mok/testing/knbvm_python2/refs/tmp
2024-05-23 14:37:59 - ERROR - FileHashStore - _verify_hashstore_references: Cid refs file exists (/home/mok/testing/knbvm_python2/refs/cids/a8/2e/b0/67ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) but pid (arctic-data.6188.1) not found. Additional Context:  Created all refs files
2024-05-23 14:37:59 - ERROR - FileHashStore - store_object: failed to store object for pid: arctic-data.6188.1. Reference files will not be created or tagged. Unexpected error: FileHashStore - _verify_hashstore_references: Cid refs file exists (/home/mok/testing/knbvm_python2/refs/cids/a8/2e/b0/67ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) but pid (arctic-data.6188.1) not found. Additional Context:  Created all refs files
2024-05-23 14:37:59 - DEBUG - FileHashStore - store_object: Removing pid: arctic-data.6188.1 from object_locked_pids.

The cid refs file affected contains two pids, but not arctic-data.6188.1 which was properly stored/verified, but somehow dropped:

arctic-data.6186.1
arctic-data.6190.1

Example Log 2 for pid tao.14330.1

2024-05-23 14:57:48 - ERROR - FileHashStore - _verify_hashstore_references: Cid refs file exists (/home/mok/testing/knbvm_python2/refs/cids/96/bb/d5/de61f36b0e10c5771d180998d066192e8986aa34a8cb7c453f62959274) but pid (tao.14330.1) not found. Additional Context:  Created all refs files
2024-05-23 14:57:48 - ERROR - FileHashStore - store_object: failed to store object for pid: tao.14330.1. Reference files will not be created or tagged. Unexpected error: FileHashStore - _verify_hashstore_references: Cid refs file exists (/home/mok/testing/knbvm_python2/refs/cids/96/bb/d5/de61f36b0e10c5771d180998d066192e8986aa34a8cb7c453f62959274) but pid (tao.14330.1) not found. Additional Context:  Created all refs files

The cid refs file content at path /home/mok/testing/knbvm_python2/refs/cids/96/bb/d5/de61f36b0e10c5771d180998d066192e8986aa34a8cb7c453f62959274:

tao.14361.1
tao.14328.1
tao.14363.1
urn:uuid:040ecbdf-009c-4d2c-bf4a-0667e7dfa3b6
urn:uuid:07482397-7b9a-47af-b8af-5aac3aade05b
urn:uuid:100c09f8-7339-413b-afbc-b95d72859e6b
urn:uuid:113aa58f-e593-403f-8118-9ad8c51fafd2
urn:uuid:15711865-4fa0-4302-a5fd-263049cef44c
doulikecookiedough commented 1 month ago

Update:

This issue appears to be resolved with the solution of checking for the existence of a cid refs file at the time of, rather than predetermining its existence and moving forward in an elif statement that evaluates booleans. Testing has gone well with 100000 objects from knbvm/test.arcticdata.io - both pid and cid refs files match those generated from the java library.

While the current fix has resolved the problems described in this issue, it does not mean the methods are completely thread/multiprocessing safe. As observed in the debugging-level logs below, two processes can still enter a shared list to check for a cid, which will cause one cid refs file to replace another, which explains why some pids were initially missing from the cid refs file.

Investigation is ongoing:

# Three processes attempt to tag the cid
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object: Tagging object cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid: arctic-data.6186.1.
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object: Tagging object cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid: arctic-data.6188.1.
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object: Tagging object cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid: arctic-data.6190.1.

# Two processes hit the synchronization block at the same time, and both passed
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object (mp): Locking cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f to to tag pid: arctic-data.6186.1.
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object (mp): Locking cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f to to tag pid: arctic-data.6188.1.
# The other thread wasn't as fast, and is waiting
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object (mp): (cid) a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f is currently locked. Waiting.

# Temp files are created
2024-05-27 13:04:33 - DEBUG - FileHashStore - _write_refs_file: Writing id (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) into a tmp file in: /home/mok/testing/knbvm_python/refs/tmp
2024-05-27 13:04:33 - DEBUG - FileHashStore - _write_refs_file: Writing id (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) into a tmp file in: /home/mok/testing/knbvm_python/refs/tmp

# The cid refs files here overwrite one another with the one above
# The first pid is tagged
2024-05-27 13:04:33 - DEBUG - FileHashStore - _verify_hashstore_references: verifying pid (arctic-data.6186.1) and cid (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) refs files. Additional Note: Reference files have been moved to their permanent location.
2024-05-27 13:04:33 - INFO - FileHashStore - tag_object: Successfully tagged cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid arctic-data.6186.1
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object (mp): Removing cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f from reference_locked_cids.
# The second pid is tagged
2024-05-27 13:04:33 - DEBUG - FileHashStore - _verify_hashstore_references: verifying pid (arctic-data.6188.1) and cid (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) refs files. Additional Note: Reference files have been moved to their permanent location.
2024-05-27 13:04:33 - INFO - FileHashStore - tag_object: Successfully tagged cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid arctic-data.6188.1
2024-05-27 13:04:33 - DEBUG - FileHashStore - tag_object (mp): Removing cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f from reference_locked_cids.

# The third process gets in
2024-05-27 13:04:34 - DEBUG - FileHashStore - tag_object (mp): Locking cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f to to tag pid: arctic-data.6190.1.
2024-05-27 13:04:34 - DEBUG - FileHashStore - tag_object: pid refs file does not exists for pid arctic-data.6190.1 but cid refs file exists at: /home/mok/testing/knbvm_python/refs/cids/a8/2e/b0/67ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f  for cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f
# We only write the pid refs file since cid refs exists
2024-05-27 13:04:34 - DEBUG - FileHashStore - _write_refs_file: Writing id (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) into a tmp file in: /home/mok/testing/knbvm_python/refs/tmp
2024-05-27 13:04:34 - DEBUG - FileHashStore - _verify_hashstore_references: verifying pid (arctic-data.6190.1) and cid (a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f) refs files. Additional Note: Pid refs file doesn't exist, but cid refs exists.
2024-05-27 13:04:34 - INFO - FileHashStore - tag_object: Successfully updated cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f with pid: arctic-data.6190.1
2024-05-27 13:04:34 - DEBUG - FileHashStore - tag_object (mp): Removing cid: a82eb067ca0959f287d95eb4b43de10a6fd8f003859533cbcb01860005ac6a8f from reference_locked_cids.
doulikecookiedough commented 1 month ago

This issue has been resolved via BugFix-97: tag_object cid refs file missing pids.

The solution involves using a threading/multiprocessing Condition, which has a built-in lock (or you can pass one to it), and coordinates all thread/processes - similar to Java's synchronized block.

Also note - pytests now run almost 50% faster (from ~14s to ~8s locally). Testing is ongoing.