Hashing files takes too long to be run on the storage server.

JesJehle commented 3 days ago

I am currently struggling to make the computation of the hash faster. For context. We want to use the hash of the file to quickly identify duplicates and lookup (maybe during upload) if the file already exists in the db. Currently, the computation of the hash is part of the upload route and is run at the end of the chunked upload. If the last chunk is uploaded, the hash is computed. I want to simplify the upload so that i don't have to use longer running async background tasks at the end. The current upload mechanism uploads the file in chunks, every chunk is directly added to the final file, no tmp files, no resource intensive background tasks. But I am struggling with the hash, since for large files > 20 GB I will hit a timeout for sure. This is true for all kinds of optimizations (using faster hashing algorithms blake3, xxhash, computing in chunks).

I think we have 3 options:

using background tasks (I don't want this, all longer running processes should be run on the processing server, not on the storage server, makes debugging and the process more complicated).
moving the computation of the hash to the processing server. (would be an option)
Simplify and speed up the hash computation. Since we only want to identify duplicates, we could speed up the computation by only using the first few MB and the last MB for computation of the hash. This would be superfast, while still doing the job.

@cmosig @mmaelicke what are your thoughts?

JesJehle commented 3 days ago

This would be an example of a quick file identifier:

def quick_file_id(file_path: Path, sample_size: int = 1024 * 1024) -> str:
    """Generate a quick file identifier by sampling start/end of file"""
    file_size = file_path.stat().st_size
    hasher = xxh64()

    with open(file_path, 'rb') as f:
        # Hash file size
        hasher.update(str(file_size).encode())

        # Hash first 1MB
        hasher.update(f.read(sample_size))

        # Hash last 1MB
        f.seek(-min(sample_size, file_size), 2)
        hasher.update(f.read(sample_size))

    return hasher.hexdigest()

mmaelicke commented 3 days ago

I think hashing only 2MB is the better option, as it is less complicated than moving the SHA calculation to another server. There would be again a lot of things that could go wrong.

I am pretty sure that the first and last MB should identify duplicates. There should be no need to hash all 20GB.

cmosig commented 3 days ago

You could maybe hash the metadata of the geotiff for duplicate detection. Could be a dirty solution that works fine in 98% of cases.

JesJehle commented 3 days ago

Nice, we need to keep in mind, that at some point it would be necessary to compute the hash client side, before the upload process. For example, to inform the user that the file is already in the db. So we need to define a method in JS and python; hence, simplicity is key. We don't have gdal on the client, so adding metadata to the hash could be problematic.

mmaelicke commented 3 days ago

I think as long as you are reading the exact amount of bytes, you should be good. SHA256 is deterministic.

I am not entirely sure how much the metadata-header change between two very, very similar images, as they are encoded in the first few kB of the tif. If now the images are identical in metadata header and have ie. a transparent corner, we might end up with the same SHA256. But that could easily be tested.

Deadwood-ai / deadwood-api

Hashing files takes too long to be run on the storage server. #91