aiidateam / disk-objectstore

An implementation of an efficient "object store" (actually, a key-value store) writing files on disk and not requiring a running server
https://disk-objectstore.readthedocs.io
MIT License
15 stars 8 forks source link

implementation for archiving a pack file #138

Open zhubonan opened 2 years ago

zhubonan commented 2 years ago

Replaces #133

Archived pack files are essentially ZIP archives. Reading from these files are also supported from offset/length as stored in the sqlite database. Because archived packs will never be used for reading, they can be stored at different file systems and networked locations. The use of ZIP archives also allows recovering data in case of the sqlite database being damaged.

The main difference between an archived pack and a normal pack is that:

  1. An archived pack file is always compressed.
  2. Compression is done by DEFLATE, but the stream is slightly different from that of a normal pack file. This is because in normal pack file compressed streams contains zlib's header/trailer (WBITTS=15, default), while for a ZIP file the streams are "raw" adn does not contain headers/trailers (WBITS=-15).

Creating an archive is a slow process, and should be carried out while the container is not activet (e.g. similar to repack). However, I think it is should still be possible to carry out as long as the pack file being archived not being written into at the same time.

A new table is needed in the sqlite database to store the status of the pack file, with two extra columns: state and location. The former would be changed to Archived if the pack is archived. The latter stores any explicit location of the archived pack file.

A cli interface is provided to list archive files and update their locations.

codecov[bot] commented 2 years ago

Codecov Report

Merging #138 (bc62f1e) into develop (16e6ff9) will decrease coverage by 2.44%. The diff coverage is 86.72%.

:exclamation: Current head bc62f1e differs from pull request most recent head c7af5af. Consider uploading reports for the commit c7af5af to get more accurate results

@@             Coverage Diff             @@
##           develop     #138      +/-   ##
===========================================
- Coverage    99.52%   97.07%   -2.45%     
===========================================
  Files            8        8              
  Lines         1676     1881     +205     
===========================================
+ Hits          1668     1826     +158     
- Misses           8       55      +47     
Impacted Files Coverage Δ
disk_objectstore/cli.py 83.96% <56.75%> (-14.59%) :arrow_down:
disk_objectstore/utils.py 96.51% <91.66%> (-3.09%) :arrow_down:
disk_objectstore/container.py 97.95% <92.76%> (-1.45%) :arrow_down:
disk_objectstore/database.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 16e6ff9...c7af5af. Read the comment docs.

zhubonan commented 2 years ago

Hi @chrisjsewell @giovannipizzi, could you please take a look of this?

Some prblems still to be solved:

  1. sqlite database of the existing container needs to be migrated. I guess this should be done with alembic?
  2. For the tests on windows. There are errors when tryign to delete a file that is still opened (the sqlite database file). Is there anyway obvious thing that I can try to track it down?