suggestion: consider using record markers in the packs

zhubonan commented 3 years ago

I have a suggestion for improving the long-temr resilience for the pack files - we should consider including record markers in the pack file (e.g. similar to the Fortran sequential file).

The main advantage is that this would allow extraction of the data without the SQLite database (the index) at all. For example, in the case of disk storage failure rendering the latter corrupted.

My understanding is that the current implementation simply concatenate the pack file with the stream, and record the offset and lengths in the SQLite database. Instead, a slight modification can be done when writing the stream, something like this:

Write a N bytes integer marker whose value is the record length
Write the actual data
Write a N bytes integer marker repeated - indicating the end of the record
Record the offset and the length of the object taking account of the record sizes.

In Fortran, N is actually compiler dependent, with N=4 a maximum record length of about 2GB but there is support for "sub-records", but this is mainly for backward compatibility with F77. One can also use a 8 byte marker so the maximum size can be greatly incrased, but this means some storage overheads would be involved. The sign of the integer may be used as a flag for compression or not, or one can include dedicicate bytes etc.

Now, I realised that a BIG difference here (compard to Fortran WRITE) is that the size of the object may be unkown before writing it, as it can just be stream when directly writting data to the pack. But we can just omit step one above, or make it a fixed value (for integrity check). and to reconstruct the index, we read the pack file backwards and find each object using the markers.

This would allow reconstruction of the SQLite index purely from the pack file. One first read the marker, then use the length to read the content (object), followed by the ending marker as a verification (optional). Then the hash can be computed for this object (uncompress it first if needed), and the offest, size, length, can be recorded into the new SQLite database.

I think the only part that would need to be updated in the code is to include the record mark(s) in the (_write_data_to_packfile_): https://github.com/aiidateam/disk-objectstore/blob/7a09ea2a953f0b0dfa79a6688306c51a501f874b/disk_objectstore/container.py#L1245-L1251

and also the repack methods.

This approach should maintain full backward compatibility as the precise size, length and offsets are stored in the SQLite database, it is just that the old pack format would not be supported for reconstruction without the SQLite database.

Reference:

Pack format used by restic backup software that allows index reconstruction (we don't need to be as complicated as this, as they also need the data to be encrypted): https://restic.readthedocs.io/en/stable/100_references.html#pack-format
Some of fortran record format: https://traktofon.github.io/FortranFiles.jl/stable/files.html
Some python code for reading fortran record: https://github.com/zhubonan/castepxbin/blob/3b57e0b9470d4217afc599c96bbef4c132026263/castepxbin/castep_bin.py#L702-L740

zhubonan commented 3 years ago

Pinning @giovannipizzi @chrisjsewell

zhubonan commented 2 years ago

Summary of the dicussion:

Problem with the existing PR #125

The current implementation in #125 is not robust enough. In particular, if the file is truncated the or the process dies while the pack file is been written, using the tailing markers will not be able to recover all file records.
Using preceding markers of each record may help but: 1. we do not usually know the file size in the first place; 2. updating the markers can be dangerous, and is possible with the appending mode ; 3. it is still prone to truncation, or "hole" in the pack file.

Suggestions for the way foward:

Another way is to add a collection of markers at the end of the pack file, when the file is "sealed", e.g. no longer accepting new inputs. The markers can just contain the offsets stored in the SQLite database.
Such "sealed" pack file would have a data structure looking very similar to ZIP file.
We should investigate whether it makes sense to just make the "sealed" packes actually the ZIP files.
This means that "unsealed" packs' indices cannot be reconstructed on their own, but there should be only a few of them in the data repository.

Some notes on ZIP format:

It would be ideal if the sealed pack is actually a ZIP files - this would make it very robust and accessible to wide avalaible softwares that can read ZIP.

The ZIP file have the following format:

Note that there present a both a LOCAL HEADER and a central directory in the end. When writing the pack file we may include the LOCAL HEADER, and the data descriptor since size and checksums are not know at the time the header is written. The header size may be something of concern? The LOCAL and the centre directory header for each record can approach 80 bytes - we should have some idea of what the average file sizes stored in the AiiDA usually is.

We can write the packs in a ZIP compatible format, without the central directory, which is only added at the end when the pack is "sealed". The latter opertions should be cheap, and there is no needed to update the offsets in the SQLite database.

zhubonan commented 2 years ago

Just want to say that I am working on this....

giovannipizzi commented 2 years ago

Cool, thanks! :-) Probably it would be good to already introduce a table of packs, which contains the state of the pack (if missing from the table = current state, editable), and contains for this PR the state Sealed, and can also change to archived when working on #123). It might be good, after sealing, to also add a column with the full pack hash (same algorithm as the container itself?) that can be used for additional validation.

giovannipizzi commented 2 years ago

It would be also useful to check if one can avoid adding the local header for each file that will take a lot of space for many small objects. E.g. looking into alternative files? It's not clear to me e.g. if 7z requires them, see https://py7zr.readthedocs.io/en/latest/archive_format.html - if it does not, it might be a good idea to use that format instead of zip? -> we can always add the very first header when creating a new pack, that is easy, but the local headers are a bit more annoying

zhubonan commented 2 years ago

I had a look of the specification and I think it is a bit complex than ZIP. 7z does not require the local header because its main advantage is to be able to have a single "packed stream". e.g. solid compression to achieve better compression ratios, instead of ZIP compressing each file individually.

Anyhow I think having the local header is a good idea as it does add more robustness in case that the end of the file central directory is damaged. There are tools that can try to "repair" damaged zip files. At the moment, my rough estimate of the local header is aound 82 bytes (30 bytes + 20 bytes for ZIP64 + 32 bytes for filename (part of the hash, to be added when sealling) ) . The same information is repeated in the end of the file central directory. So if we switch to a format that does not have local headers, then the maximum space save is just about 50%.

zhubonan commented 2 years ago

One thing with ZIP is that the CRC check sum needs to be in the header. Since CRC is only avalaible after the data is written to the pack, it has to be stored somewhere and reused to seal the pack "sealed" - making a fully functioning ZIP file. One option is to keep it in the data descritpor file, an optional 20 bytes tailing marker. But I think we can get rid it and save the space, since the actual marker needs to be updated anyway.

I am wondering how difficult it would be to add a column to the SQLite table Obj. Ideally the CRC can be stored there. But I remember migrating SQLA might not be straightforward. If CRC is not avalaible, we can still get it by reading the data while sealing the pack file, but it can make the process much slower.

Now the header size is 30 bytes (mandatoray) + 20 bytes (ZIP64 support for files larger than 4GB) + 16 bytes (filename from part of the hash) = 66 bytes.

Would this be someting acceptable in exchange to have pack files compatible as ZIP archives? @giovannipizzi @chrisjsewell

giovannipizzi commented 2 years ago

I don't think we should compute CRC on the fly or change the SQLite table. I suggest we instead leave the correct number of bytes empty (e.g. set to 0x00 0x00 0x00...) before each section (or some other temporary marker) that is essentially ignored during normal usage; and during the sealing procedure we compute the CRC on the fly and other things that are needed, write it in the right place (so no need to rewrite the file), and seal the file (I tested on my laptop and CRC32 on ~6GB takes less than 3 seconds; I didn't check the cost of seeking and writing a lot of short byte sequences in an existing file though, but I imagine it's not impossibly slow, but I would guess slower than computing the CRC). Then one just appends the ZIP final central directory and one is done?

zhubonan commented 2 years ago

Yes that would be a good way to do it. One thing to note is that the CRC is calculated for the uncompressed data, so if compression is used, the data needs to be uncompressed first. Assumming a thoughpout of 100MB per second, a 6GB file would take at least one minute to seal.

zhubonan commented 2 years ago

Alternatively we can just store the CRC computed on-the-fly tailing the record, e.g. taking 4 bytes space. During sealing we write it back in the local header. The resulting file should still be ZIP compatible.

giovannipizzi commented 2 years ago

I would be tempted to say that, while sealing, it wouldn't be a bad idea to validate the content recomputing the hashes (sha256), what do you think? (we already have a function for that). If we agree on this, then also computing the CRC is a little additional cost? (and we could modify the validate function to also compute the CRC on the fly?)

zhubonan commented 2 years ago

good point, I agree. Let's have the sealing operation also validate the hash and compute CRC - both can be done in the same time. The sealling process does not bring any down time - since the actual contents of the file remain unchanged. Hence, other process may access it for reading.

aiidateam / disk-objectstore

suggestion: consider using record markers in the packs #124