aiidateam / disk-objectstore

An implementation of an efficient "object store" (actually, a key-value store) writing files on disk and not requiring a running server
https://disk-objectstore.readthedocs.io
MIT License
15 stars 8 forks source link

Allow sealing pack files in a way that turns them into valid ZIP archives #133

Closed zhubonan closed 2 years ago

zhubonan commented 2 years ago

This PR addresses #124 and potentially also #123.

The PR is still a work in process. My current aim is to have both two issues addressed here.

Brief summary

When writing to pack files, a ZIP-style local header is written before each record. To seal a pack, its integrity is checked and the local headers are updated with checksums, and a central header directory is appended to the end. This operations does not block and any concurrent read access. A new table is introduced to storage the status of each pack file. Technically, writing to a sealed pack works just fine, and the pack can still be resealled, but this would waste some spaces. Hence the table is consulted and sealed packs will not be selected for writing.

After sealing, the pack file becomes a ZIP archive. Each record is a file named by the first 16 characters of its hash value (there can be duplicated names, but the chance is low due to the fine sizes of each sealed pack). The raw data may access by simply extracting the archive using any ZIP compatibe tools.

Todo

giovannipizzi commented 2 years ago

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

zhubonan commented 2 years ago

Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?

It is just for spacing saving, since the file name has to be included in both the central directory and the local header. Including the full hash (taking 64 bytes as ASCII) is not really necessary since that can be computed from the content. We can check for clash when sealing the data perhaps.

giovannipizzi commented 2 years ago

For 10 million objects the cost would be ~600MB (times two = ~1.2GB) with the full hash. So with 16 bytes we spare ~ 900MB. I'm tempted to say it's OK not to spare this space but we have a simpler code? (also, 10 million objects would most probably fill many packs, so the 900MB will most probably be a small % of the total size in most cases).

zhubonan commented 2 years ago

Hmm I see, in that case I will just go with the full hash. The part that gets simplied is that we don't have to check for any duplications when assigning file names during the sealing process. When updating the filename length still has to be verified, otherwise there is a risk of overwriting the actual data.

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

giovannipizzi commented 2 years ago

Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.

But in this case you need to store the hashkey sharded (e.g. 01/839ab302...), depending on the settings of the container. (The default is 1 level of length 2, but it's configurable).

I don't know if this in total might increase or decrease the size of the ZIP? Filenames are 2 bytes shorter (I don't know if only the actual filename or the full path should be put in front of every file?), but now we have also to store in which folder, so probably this makes things even more expensive?

zhubonan commented 2 years ago

Oh I see, my bad. I thought that the loose files are stored in a flat directory 😓 Whith sharded directory, one should in principle be able to write a simgple bash script to put the file in the correct folder with correct name. Maybe it is a bit stretched as intented usage though.

codecov[bot] commented 2 years ago

Codecov Report

Merging #133 (3ccd60a) into develop (16e6ff9) will decrease coverage by 3.13%. The diff coverage is 77.20%.

@@             Coverage Diff             @@
##           develop     #133      +/-   ##
===========================================
- Coverage    99.52%   96.38%   -3.14%     
===========================================
  Files            8        9       +1     
  Lines         1676     1936     +260     
===========================================
+ Hits          1668     1866     +198     
- Misses           8       70      +62     
Impacted Files Coverage Δ
disk_objectstore/zipsupport.py 61.06% <61.06%> (ø)
disk_objectstore/container.py 97.66% <85.59%> (-1.74%) :arrow_down:
disk_objectstore/utils.py 99.43% <96.66%> (-0.17%) :arrow_down:
disk_objectstore/database.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 16e6ff9...3ccd60a. Read the comment docs.

zhubonan commented 2 years ago

Just want to recap on what we have discussed previously. The sealing would be treated as a slow operation similar to repack. I wil try to see if we can use the zipfile stdlib to directly construct a zip file and have the offsets/size/length of each entry recorded in the sqlite. This way the code can be simplified a lot and easier to maintain.

I also did some investation on my repository. It is true that most of the objects are small files. The median size is just only 2kytes! But the small files also take relatively small portion of the total size, despite in large number. So a overhead of about 100-200 bytes per object is a lot for the small objects, but the overall impact on the total size is much smaller. My repo is 80G on disk, and the estimated overhead based on total number of object is about 100-200MB. So I think that having archived pack files would still have minimal overalls impact on the total size on disk.

zhubonan commented 2 years ago

close in favour of #138