cal-adapt / climakitae

A Python toolkit for retrieving, visualizing, and performing scientific analyses with data from the Cal-Adapt Analytics Engine.
https://climakitae.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 2 forks source link

Make updates to data_export #329

Closed elehmer closed 5 months ago

elehmer commented 6 months ago

Description of PR

Making changes to data_export to remove redundant code, clean up, and give more buffer to user partition   Summary of changes and related issue

Relevant motivation and context JupyterHub user partitions have filled up and frozen users out and the large files that did it were netCDF exports. This PR should resolve this in most cases.

Dependencies required for this change?

Fixes # (issue), delete if not necessary

Type of change

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

mdkoenig commented 6 months ago

Look into Export partition issues

Tianchi-Liu commented 5 months ago

I love how elegant the code is now with the refactor!

Tianchi-Liu commented 5 months ago

While I’m still doing tests, I have a question about the option to specify the save location. I am not sure if the users need this option, or how it helps with the user partitions issue. Could you clarify the motivation for adding it?

elehmer commented 5 months ago

While I’m still doing tests, I have a question about the option to specify the save location. I am not sure if the users need this option, or how it helps with the user partitions issue. Could you clarify the motivation for adding it?

I added it because I was having an issue where I couldn't download large files from the HUB, but that has resolved itself. It is invisible to the notebooks (no changes required). I think it is a nice option to have if you know the file is too big before hand or it could be useful of us when doing scripting in the future.

Tianchi-Liu commented 5 months ago

While I’m still doing tests, I have a question about the option to specify the save location. I am not sure if the users need this option, or how it helps with the user partitions issue. Could you clarify the motivation for adding it?

I added it because I was having an issue where I couldn't download large files from the HUB, but that has resolved itself. It is invisible to the notebooks (no changes required). I think it is a nice option to have if you know the file is too big before hand or it could be useful of us when doing scripting in the future.

Got it. I can definitely see it becoming useful for developers. I think it would also be nice if the users have the option to save to s3, in case they run into issues downloading from the hub. Users, however, could use more guidance. The case when the local option cannot be honored also needs to be handled. This might be a larger discussion. For this PR, I’d suggest keeping the option in a hidden function for developers until user-facing features are fully developed.

elehmer commented 5 months ago

While I’m still doing tests, I have a question about the option to specify the save location. I am not sure if the users need this option, or how it helps with the user partitions issue. Could you clarify the motivation for adding it?

I added it because I was having an issue where I couldn't download large files from the HUB, but that has resolved itself. It is invisible to the notebooks (no changes required). I think it is a nice option to have if you know the file is too big before hand or it could be useful of us when doing scripting in the future.

Got it. I can definitely see it becoming useful for developers. I think it would also be nice if the users have the option to save to s3, in case they run into issues downloading from the hub. Users, however, could use more guidance. The case when the local option cannot be honored also needs to be handled. This might be a larger discussion. For this PR, I’d suggest keeping the option in a hidden function for developers until user-facing features are fully developed.

Ah! I see my error in logic! Thanks! There is a scenario where they could specify "local" and fill up the drive! Oops!

elehmer commented 5 months ago

So, there seems to me to be a practical limit to what can be exported to CSV. At somewhere between 0.75-1.00GB DataArray size (at least for WRF) the conversion to pandas DataFrame exceeds the available memory on the HUB and kernel crashes. So I have added logic to handle that. Estimating CSV file size before generating it is very hard as its relative size compared to the memory size of the array varies. Very small memory arrays end up taking more space (even compressed) on export. Larger arrays end up being roughly 2x the size of the compressed CSV (although the uncompress CSV is 10x the array size!). So the 0.85GB limit I have imposed seems like a good idea. I would like some testing around this to optimize this threshold.

elehmer commented 5 months ago

While I’m still doing tests, I have a question about the option to specify the save location. I am not sure if the users need this option, or how it helps with the user partitions issue. Could you clarify the motivation for adding it?

I added it because I was having an issue where I couldn't download large files from the HUB, but that has resolved itself. It is invisible to the notebooks (no changes required). I think it is a nice option to have if you know the file is too big before hand or it could be useful of us when doing scripting in the future.

Got it. I can definitely see it becoming useful for developers. I think it would also be nice if the users have the option to save to s3, in case they run into issues downloading from the hub. Users, however, could use more guidance. The case when the local option cannot be honored also needs to be handled. This might be a larger discussion. For this PR, I’d suggest keeping the option in a hidden function for developers until user-facing features are fully developed.

Ah! I see my error in logic! Thanks! There is a scenario where they could specify "local" and fill up the drive! Oops!

5a408b1 Should fix this

elehmer commented 5 months ago

Ability to estimate file sizes (CSV)

elehmer commented 4 months ago

Can't download big file