EMS-TU-Ilmenau / chefkoch

A compute cluster cuisine for distributed scientific computing in python
Apache License 2.0
5 stars 1 forks source link

Investigate on best-practices regarding tarballs and temporary files in python #53

Open ChristophWWagner opened 4 years ago

ChristophWWagner commented 4 years ago

In chefkoch we would like to support calling shell scripts, too and these often tend to work on multiple files or directories. For a variety of reasons it is desireable to handle these situations with containers:

Tarballs are a de-facto standard in unix. They provide no compression (unless combined with gzip to produce the infamous .tar.gz), but retain user permissions and even ACLs. Also, the no-compression constraint is actually a feature when it comes to performance. Adding the wide support for tar, this seems to be the natural choice for container format. However, feel free to suggest alternatives, if you come across one.

This issue shall:

Show example functions for

pegro commented 4 years ago

With containers you mean archive file formats like tar and zip, not things like Docker containers, right?

How long would these containers live? Are they meant for assembling the results for long-term storage? Or for transferring files over a network? Or is it meant for short-lived interactions between local successive procedures? Then constant packing+deleting+unpacking might create a lot of overhead?!

makley273 commented 4 years ago

We could easily implent those functions using the lib/tarball library. So we don't really have new dependencies out of standard libraries.

Open or packing a tarball is possible throug tarbal.open command Unfurtunaly adding seems to be implemented through an additional tarball.add command. Additionally it only can add file per command.

Extracting could be done all at once with extractall(destnination) command.

But we could surely implement a own functions pack, unpack and test function as a tiny wrapper. I've already written pack and unpack, but testing for consistency is more difficult.

import tarfile as tf
import os.path

def pack(filename, *files):
    with tf.open(filename, "a") as tar:
        for file in files:
            tar.add(os.path.basename(file))
        tar.close()

def unpack(filename, destination):
    with tf.open(filename, "r") as tar:
        tar.extractall(destination)
        tar.close()
makley273 commented 4 years ago

It seems testing tarballs isn't that easy. With inbuilt linux tools you can only verify tar while creating them. I didn't find any python libraries able to validate tar archieves. But we may could implement a own function.

My approach would be a function comparing the sizes of the archieved files plus header with the size of the archieve itself. I will try to investigate on it.

makley273 commented 4 years ago

I've got two functions for testing tar archives. Does anyone has some good and fault example archives to supply, please?

ChristophWWagner commented 4 years ago

Great! The purpose of using tar files is to have a means of "packing" a full directory structure into a single blob of data, such that we can handle it just like any other data object/file within the fridge. This makes handling tar balls somewhat unique to the classes Fridge, StepShell and Resource in the following ways:

Resource must be able to create a tar ball from the resource directory given. The created tar ball will then be treated as the data object of that item and be hashed. Based on that data hash, the Resource item will be added to the fridge

StepShell must be able to extract a tar ball from an Item object that contains a tar ball (this should only be Resource objects, if I am not forgetting something) to a temporary directory, where the given command is executed. After execution, the designated output objects shall be added to the fridge by creating Item objects and for further inspection, shall create a tar ball from the contents of the temporary directory after execution and add this to the fridge for this particular run (intended for later inspection of build leftovers or similar)

Fridge shall be able to verify the integrity of such a tar ball. However, I would guess that this is not necessary now and is a nice-to-have feature for later

Item shall be able to report the contents of the tarball for easy inspection from the chefkoch CLI. Howeer this, too, is a nice-to-have feature than may be added later in.

From the current point-of-view I'd proceed with creating a TarBall class that supports these cases and add it to the same module that either the Fridge or the Items reside.

makley273 commented 4 years ago

The integrity test should be possible. I wrote a function comparing the filesize with the size of all Items plus their header. So it is possible to detect, if data is not seen by archive failures. But for now I'm not 100% sure if it works with all file types and all tar version due to not enough test archives.

I would be grateful if anyone can provide some. Otherwise I will create some archives of every encoding containing every filetype later.