Open ss756 opened 1 year ago
Hello, I'm currently facing issues when trying to handle large data files. While I don't have any experience with Dask, I guess a more straightforward solution might involve introducing a "chunksize" parameter similar to the one used in pandas, allowing the processing of larger files in smaller pieces.
Would you like to work on this together ?
Hi @ss756 , I managed to fork the main branch and started working in the solution I mentioned. I introduced a new function named fetch_rows
that takes as parameter the number of lines to be loaded:
https://github.com/hugotallys/qvd-utils/blob/e4d089a9cf3e33c8e71a16fd359d3ba025e49bed/src/lib.rs#L38
In the python wrapper ( qvd_reader.py
file ) the function can be called as :
def read(file_name):
reader = QvdReader(file_name)
data = reader.fetch_rows(10)
data = reader.fetch_rows(5)
df = pd.DataFrame.from_dict(data)
return df
I am still adjusting the implementation as it only fetches the first N rows from qvd files (at each call of fetch_data
the file is opened and reads the same lines). If you're interested in improving the solution for this issue we can work over the fork as collaborators.
Hi @SBentley , would you be interested in adding a functionality to interface this repository with Dask ? We can work on this