Closed zhubonan closed 2 years ago
Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?
Thanks @zhubonan ! Is there a reason to limit the filename to 16 characters only?
It is just for spacing saving, since the file name has to be included in both the central directory and the local header. Including the full hash (taking 64 bytes as ASCII) is not really necessary since that can be computed from the content. We can check for clash when sealing the data perhaps.
For 10 million objects the cost would be ~600MB (times two = ~1.2GB) with the full hash. So with 16 bytes we spare ~ 900MB. I'm tempted to say it's OK not to spare this space but we have a simpler code? (also, 10 million objects would most probably fill many packs, so the 900MB will most probably be a small % of the total size in most cases).
Hmm I see, in that case I will just go with the full hash. The part that gets simplied is that we don't have to check for any duplications when assigning file names during the sealing process. When updating the filename length still has to be verified, otherwise there is a risk of overwriting the actual data.
Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.
Another added benefit with full hash filename is that the sealed pack can be loosened by simple decompression and copying the files to the loose object folder. This might be useful when one want to move data from one repository to another.
But in this case you need to store the hashkey sharded (e.g. 01/839ab302...), depending on the settings of the container. (The default is 1 level of length 2, but it's configurable).
I don't know if this in total might increase or decrease the size of the ZIP? Filenames are 2 bytes shorter (I don't know if only the actual filename or the full path should be put in front of every file?), but now we have also to store in which folder, so probably this makes things even more expensive?
Oh I see, my bad. I thought that the loose files are stored in a flat directory 😓 Whith sharded directory, one should in principle be able to write a simgple bash script to put the file in the correct folder with correct name. Maybe it is a bit stretched as intented usage though.
Merging #133 (3ccd60a) into develop (16e6ff9) will decrease coverage by
3.13%
. The diff coverage is77.20%
.
@@ Coverage Diff @@
## develop #133 +/- ##
===========================================
- Coverage 99.52% 96.38% -3.14%
===========================================
Files 8 9 +1
Lines 1676 1936 +260
===========================================
+ Hits 1668 1866 +198
- Misses 8 70 +62
Impacted Files | Coverage Δ | |
---|---|---|
disk_objectstore/zipsupport.py | 61.06% <61.06%> (ø) |
|
disk_objectstore/container.py | 97.66% <85.59%> (-1.74%) |
:arrow_down: |
disk_objectstore/utils.py | 99.43% <96.66%> (-0.17%) |
:arrow_down: |
disk_objectstore/database.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 16e6ff9...3ccd60a. Read the comment docs.
Just want to recap on what we have discussed previously. The sealing would be treated as a slow operation similar to repack. I wil try to see if we can use the zipfile stdlib to directly construct a zip file and have the offsets/size/length of each entry recorded in the sqlite. This way the code can be simplified a lot and easier to maintain.
I also did some investation on my repository. It is true that most of the objects are small files. The median size is just only 2kytes! But the small files also take relatively small portion of the total size, despite in large number. So a overhead of about 100-200 bytes per object is a lot for the small objects, but the overall impact on the total size is much smaller. My repo is 80G on disk, and the estimated overhead based on total number of object is about 100-200MB. So I think that having archived pack files would still have minimal overalls impact on the total size on disk.
close in favour of #138
This PR addresses #124 and potentially also #123.
The PR is still a work in process. My current aim is to have both two issues addressed here.
Brief summary
When writing to pack files, a ZIP-style local header is written before each record. To seal a pack, its integrity is checked and the local headers are updated with checksums, and a central header directory is appended to the end. This operations does not block and any concurrent read access. A new table is introduced to storage the status of each pack file. Technically, writing to a sealed pack works just fine, and the pack can still be resealled, but this would waste some spaces. Hence the table is consulted and sealed packs will not be selected for writing.
After sealing, the pack file becomes a ZIP archive. Each record is a file named by the first 16 characters of its hash value (there can be duplicated names, but the chance is low due to the fine sizes of each sealed pack). The raw data may access by simply extracting the archive using any ZIP compatibe tools.
Todo