🐞 Performance Regression in `bucketfs-python` Compared to `curl` and Previous API

ckunki commented 8 months ago

@ahsimb reports bucketfs-python to be multiple times slower than curl.

Summary

The new bucketfs-python API is significantly slower when transferring large files (multiple MBs/GBs) compared to using curl and the previous API version.

Reproducing the Issue

Reproducibility: always

Steps to reproduce the behavior:

Use the new bucketfs-python API to upload a large file (several MBs or GBs).

import exasol.bucketfs as bfs  # type: ignore

bucketfs = bfs.Service(buckfs_url, buckfs_credentials)
bucket = bucketfs[bucket_name]
bucket.upload(bfs_file_name, pickle.dumps(object))

Compare the upload time with that of curl and the older bucketfs-python API method. Old API:

exasol_bucketfs_utils_python.bucketfs_location.BucketFSLocation.upload_fileobj_to_bucketfs

Expected Behaviour

The new bucketfs-python API should offer comparable performance to the old API and ideally also to methods like curl.

Actual Behaviour

The upload process with the new API is significantly slower than using curl and the previous API version, affecting efficiency and throughput for large file transfers.

ahsimb commented 8 months ago

It appears to be slower than using the exasol_bucketfs_utils_python.bucketfs_location.BucketFSLocation.upload_fileobj_to_bucketfs.

ahsimb commented 8 months ago

The code that seems slow in comparison is this:

import exasol.bucketfs as bfs  # type: ignore

bucketfs = bfs.Service(buckfs_url, buckfs_credentials)
bucket = bucketfs[bucket_name]
bucket.upload(bfs_file_name, pickle.dumps(object))

tkilias commented 8 months ago

@ahsimb I had a look at the code of both implementation. The actual upload is handle identical. However, the new implementation fetches all buckets on the service before returning the bucket. @Nicoretti do we really need to read the buckets before returning a bucket object? Or do we want add a option, that you can disable that.

So, to the question is, is it only significant slower for small files or also for larger files.

Nicoretti commented 8 months ago

@tkilias what do you mean with fetches all buckets (do you mean the listing of which buckets are available)? @tkilias afik @ahsimb said, it is slow for large files (GBs) or multiple (MBs).

tkilias commented 8 months ago

@Nicoretti yes, I mean the listing if it is slow for large files, it is mysterious.

Nicoretti commented 8 months ago

@tkilias we should establish a "performance" regression test for the 2-3 types of access, ensuring that any variation falls within a predefined epsilon range. Any suggestion what reasonable epsilon would be here?

ahsimb commented 6 months ago

I did some timing uploading files of about 1/4 Gb in size to the Docker-DB and couldn't see a difference between the old and the new interface. The new interface has an overhead of getting a list of buckets from the server. This is a separate HTTP(s) request. But for large files this overhead is relatively small.

exasol / bucketfs-python