MuellerConstantin / PyQvd

Utility library for reading/writing Qlik View Data (QVD) files in Python.
https://pypi.org/project/PyQvd/
MIT License
12 stars 1 forks source link

Reading huge files #4

Open falko100 opened 1 week ago

falko100 commented 1 week ago

I have QVD file of 900mb and it wont read it or takes too long. Is there any way to show a progress bar while reading the file? Or at least some sort of indicator?

MuellerConstantin commented 1 week ago

Short Answer:

Of course this library has its limitations in terms of data size. It should be obvious that this library cannot deliver the same performance as a Qlik Sense installation when reading the data. 900mb is a lot, I assume we are talking about several tens of millions of rows, right?

Long Answer:

On the other hand, if you're reading large files it makes sense to track the progress, I agree with you there. One could think about a mechanism similar to pandas. pandas offers the option to read large files chunk by chunk and between reading the chunks you could update a progress bar with tqdm for example.

The problem is the way QVD stores its data. In a QVD file the data isn't stored in a plain tabular format like in a CSV. The cell values are stored in a separate table (Symbol Table) and the actual rows (Index Table) only contains value indices, hence must be decoded. So you would have to read the entire Symbol Table at the beginning in order to be able to decode the rows. The individual rows (Index Table) could then be read in chunks, similar to pandas. So if the problem with the large QVD file is that there are too many rows, then reading chunks helps. However, if the size is (also) due to too many cell values, this approach is only of limited help because the entire symbol table has to be read in at the beginning for decoding anyway.

So what's the shape of your 900mb QVD file? Are there so many rows? I have to check for a way to integrate chunking into the existing major version API. Any suggestions?

falko100 commented 1 week ago

I don't have Qlik myself, these files are supplied by my client. Indeed they are very big, even Sublime Text has trouble opening the csv after conversion. A slightly bigger .qvd managed to get read and written to a csv. This one has 5.310.170 rows.

Functionally I'd probably like to see a reader that has a pointer-based system so you can trigger reading one or many lines and handle those before the next lines are read. If you add a total amount of rows and current position of the pointer people are free to add their own progress bars.

MuellerConstantin commented 5 days ago

Sounds like a similar solution like the pandas way. We have to introduce an additional option for specifying the chunk/iteration size:

from pyqvd import QvdTable

itr = QvdTable.from_qvd("path/to/file.qvd", chunksize=100)

for chunk in itr:
    ...

To implement this we would need a class analogous to pandas TextFileReader class, which acts as an iterator and in this case is returned instead of the QvdTable.