Open jjacobson95 opened 7 months ago
This seems like a good move. It'd be nice to move all the build scripts to polars as well, but that will break a lot.
Great, I fully agree with that too.
Just to be explicit (I thought it was assumed), we should make sure that the forward facing code is still in pandas, so that users dont have to learn polars to use the package.
I thought the goal was to migrate the package from pandas to polars? Pandas is slower and has memory issues. If users have to learn one of the two, they might as well learn the one that is more optimized for our data?
Everything in the package that uses pandas is user-facing.
As this code is nearly complete, we could have two versions if we wanted? - DatasetLoader_pd() and DatasetLoader_pl()?
But having two versions would also increase time for future code development such as the upcoming DataTypeLoader.
no, that'd be a nightmare. Just add a flag called 'use_polars' and set the default to False. HOw you imagine this working when #111 is complete?
Oh yeah, thats a much better solution (using a flag). For #111, I assume we should also create a pandas and polars version as well (with the same flag option)?
I was hoping it'd speed up things on the backend, but polars is too new to really rely on at this stage. Basically check the flag and add in a to_pandas
at the end of every function. I dont think this requires 'maintaining two versions of the codebase'. We can set the flag to default to True
if polars becomes the standard that pandas is today.
@sgosline We've briefly spoken about this before and it came up after my comp bio presentation as well. I'd like to do this before addressing #111.
Currently the package is build from Python, converting it all to Polars will speed up operations and make it less susceptible to memory issues as we continue to add more data. Even now, I could see users on low memory devices potentially having issues.
This update should address the following 7 files: