PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

Convert Python Package from Pandas to Polars #159

Open jjacobson95 opened 4 months ago

jjacobson95 commented 4 months ago

@sgosline We've briefly spoken about this before and it came up after my comp bio presentation as well. I'd like to do this before addressing #111.

Currently the package is build from Python, converting it all to Polars will speed up operations and make it less susceptible to memory issues as we continue to add more data. Even now, I could see users on low memory devices potentially having issues.

This update should address the following 7 files:

sgosline commented 4 months ago

This seems like a good move. It'd be nice to move all the build scripts to polars as well, but that will break a lot.

jjacobson95 commented 4 months ago

Great, I fully agree with that too.

sgosline commented 4 months ago

Just to be explicit (I thought it was assumed), we should make sure that the forward facing code is still in pandas, so that users dont have to learn polars to use the package.

jjacobson95 commented 4 months ago

I thought the goal was to migrate the package from pandas to polars? Pandas is slower and has memory issues. If users have to learn one of the two, they might as well learn the one that is more optimized for our data?

Everything in the package that uses pandas is user-facing.

jjacobson95 commented 4 months ago

As this code is nearly complete, we could have two versions if we wanted? - DatasetLoader_pd() and DatasetLoader_pl()?

But having two versions would also increase time for future code development such as the upcoming DataTypeLoader.

sgosline commented 4 months ago

no, that'd be a nightmare. Just add a flag called 'use_polars' and set the default to False. HOw you imagine this working when #111 is complete?

jjacobson95 commented 4 months ago

Oh yeah, thats a much better solution (using a flag). For #111, I assume we should also create a pandas and polars version as well (with the same flag option)?

sgosline commented 4 months ago

I was hoping it'd speed up things on the backend, but polars is too new to really rely on at this stage. Basically check the flag and add in a to_pandas at the end of every function. I dont think this requires 'maintaining two versions of the codebase'. We can set the flag to default to True if polars becomes the standard that pandas is today.