Interfacing with Dask - Githubissues

ss756 commented 1 year ago

Hi @SBentley , would you be interested in adding a functionality to interface this repository with Dask ? We can work on this

hugotallys commented 10 months ago

Hello, I'm currently facing issues when trying to handle large data files. While I don't have any experience with Dask, I guess a more straightforward solution might involve introducing a "chunksize" parameter similar to the one used in pandas, allowing the processing of larger files in smaller pieces.

ss756 commented 10 months ago

Would you like to work on this together ?

hugotallys commented 9 months ago

Hi @ss756 , I managed to fork the main branch and started working in the solution I mentioned. I introduced a new function named fetch_rows that takes as parameter the number of lines to be loaded:

https://github.com/hugotallys/qvd-utils/blob/e4d089a9cf3e33c8e71a16fd359d3ba025e49bed/src/lib.rs#L38

In the python wrapper ( qvd_reader.py file ) the function can be called as :

def read(file_name):
    reader = QvdReader(file_name)
    data = reader.fetch_rows(10)
    data = reader.fetch_rows(5)
    df = pd.DataFrame.from_dict(data)
    return df

I am still adjusting the implementation as it only fetches the first N rows from qvd files (at each call of fetch_data the file is opened and reads the same lines). If you're interested in improving the solution for this issue we can work over the fork as collaborators.

SBentley / qvd-utils

Interfacing with Dask #22