MuellerConstantin / PyQvd

Utility library for reading/writing Qlik View Data (QVD) files in Python.
https://pypi.org/project/PyQvd/
MIT License
12 stars 1 forks source link

Allow reading QVD files from other sources than filesystem #2

Closed msimmoteit-neozo closed 5 months ago

msimmoteit-neozo commented 5 months ago

Hi,

I'm watching this library with great interest. I was wondering if it was possible to change the API to allow supplying QVD files via bytes objects or via a supplied file handler.

MuellerConstantin commented 5 months ago

I'm not completely sure if I understand what you mean... Do you want an option for constructing a QVD file based on a bytes object? Could you please post some minimal code example of the method/API you want?

msimmoteit-neozo commented 5 months ago

Thanks for your quick response. Yeah, that was pretty much my idea. Currently the usage looks like this:

from pyqvd import QvdDataFrame

df = QvdDataFrame.from_qvd('sample.qvd')
print(df.head(5))

But for use cases where a qvd file would not be stored on disk, for example in object storage, it would be convenient not having to write it on disk first:

from pyqvd import QvdDataFrame
from google.cloud.storage import Client

client = Client()
bucket = client.get_bucket(MYBUCKET)
blob = bucket.get_blob(MYFILE)
downloaded_file = blob.download_as_bytes()
df = QvdDataFrame.from_qvd_bytes(downloaded_file)
print(df.head(5))

But I think sometimes it can be unwieldy to interact with bytes directly (bytes, because the raw data in .qvd files would lead to decode errors for string types). For the specific use case of interacting with object storages there is a library called smart_open that implements Pythons file API on top of object storages. If PyQvd had an API to read from Python files, it could look like this:

from pyqvd import QvdDataFrame
import smart_open

with smart_open.open("url/to/my/object", "rb") as fin:
    df = QvdDataFrame.from_qvd_file(fin)

print(df.head(5))

I think this would be nice and generic as in this case the _read_data method could go from this:

def _read_data(self):
        """
        Reads the data of the QVD file into memory.
        """
        with open(self._path, 'rb') as file:
            self._buffer = file.read()

to this:

def _read_data(self):
        """
        Reads the data of the QVD file into memory.
        """
        if (isinstance(file, io.TextIOBase)
            or isinstance(file, io.BufferedIOBase)
            or isinstance(file, io.RawIOBase)
            or isinstance(file, io.IOBase)):
            try:
                self._buffer = self._file.read()
            except UnicodeDecodeError as e:
                raise Exception("Supply a raw file access. Use mode \"rb\" instead of mode \"r\"")
MuellerConstantin commented 5 months ago

I like your request, didn't think about object storages, or other storages than the local file system in general, until now... So if I understand you correctly you suggest that it should be able to pass an I/O stream as alternative to a string path to QvdFileReader or QvdFileWriter, right? Sounds like a very useful expansion!

msimmoteit-neozo commented 5 months ago

Exactly! Thank you so much for your consideration.

MuellerConstantin commented 5 months ago

I started working on it and added a feature branch. The commit ccb8604a40b45aa8f0a53b093a499db11f6ee85d add a first version of an extended API that supports reading (from_stream()) and writing (to_stream()) binary streams as an alternative to files. I modified your suggestion and limited the supported streams to binary streams (no text-based streams e.g. TextIOBase are supported). The binary stream must be a subclass of RawIOBase or BufferedIOBase.

from pyqvd import QvdDataFrame

some_stream = ...
df = QvdDataFrame.from_stream(some_stream)

...

other_stream = ...
df.to_stream(other_stream)

Feel free to check it out and comment it. If there are no changes or objections I would include the feature in the next minor release.

MuellerConstantin commented 5 months ago

The requested feature of reading/writing binary streams as an alternative to local disk files is included in the next minor release v1.1.0.

msimmoteit-neozo commented 5 months ago

Thank you so much for this. It works great.