Saving process for generated data is slow

SPL-ethz / snow

A Python package describing Stochastic Nucleation of Water in vials

https://spl-ethz.github.io/snow

MIT License

6 stars 1 forks source link

Saving process for generated data is slow #38

Closed ltdeck closed 2 years ago

ltdeck commented 2 years ago

When saving the pallet files for the 3D model validation, I noticed an excessive runtime for the first set up of data saved, see figure below:

slowdatasaving

The issue refers to the current version of the snowfall_validation file. However, I also noticed similar issues before; due to the smaller scale of the box and 2-D shelf simulations, this likely went unnoticed for quite some time.

ltdeck commented 2 years ago

Labeled as enhancement, since it does not affect the results.

ltdeck commented 2 years ago

pallet08_runtime

Saving the pallet simulation (total time 6e6, storage temperature -8°C) required nearly one hour. On the one hand, this is slower than it should be, on the other hand it is not a major issue, compared to the overall computational time (12 h).

bizzinho commented 2 years ago

hey @ltdeck ; ok, will look into it.

For future reference, please paste code in text form and not as screenshots so I can just copy&paste myself ;) also, the minimum working example I was referencing in slack would imply that I can just copy&paste the entire code into an empty notebook or some such and reproduce the error start to finish. That's what I meant ;)

bizzinho commented 2 years ago

Now, in general, this might be slow because numpy.savetext goes row by row through the array and does the formatting into eng format. This can be helped a bit by specifying a lighter format, such as int (%i) which should work just fine for these arrays here. Alternatively, you can cast to pandas dataframe and let use a different method of writing, which is almost as fast as pre-specifying the format:

Of course, I wouldn't necessarily recommend csv for such big files anyway. pickles or a .npy format that you get with numpy.save might be better in many cases. Since these are basically vectors, there is probably not much to gain from using something entirely different like parquet, feather, etc.

Now as for the slow first execution of savetext. I'm not sure, I wasn't able to replicate this with smaller snowflakes (40x40x1). Maybe you can share a full code snippet to reproduce this? Otherwise we can close this issue, as it is more a numpy issue than a snowflake one.

ltdeck commented 2 years ago

Thanks for looking at it. I prepared a fast sample notebook that you can run right away, needs for me 30s to execute. "validation_snowfall_snow". For me it was 30 times slower to save the first time compared to the next times.

ltdeck commented 2 years ago

I tested saving the data as integers. It was faster, similar to your case. It did not resolve the general issue, though, of slow times for the first time saving. File sizes indeed are smaller, although we should be careful to use it for the nucleation temperatures.

bizzinho commented 2 years ago

Hey @ltdeck , I have the suspicion that the problem is actually not np.savetxt after all. See below screenshot. Can you try this yourself and confirm?

ltdeck commented 2 years ago

@bizzinho interesting point. I just tried it as well, and it is similar with me. The first call of the nucleationTemperatures function requires more time. Vice versa if the solidificationTimes function is called first; whatever is called first requires more time. So it is not a saving issue after all.

bizzinho commented 2 years ago

hey @ltdeck , I made a change in a new branch (see 332d25f) that seems to cut the time it takes to compute this in half, can you check it out?

In short, the issue was that the first call to any of these functions in Snowfall was used to build the underlying dataframe containing all the statistics. This was done by iteratively calling the individual to_frame() methods of the Snowflakes, which is not the most efficient way of doing things, since there is a lot of repetition (group IDs, vial IDs).

ltdeck commented 2 years ago

Hey @bizzinho I tested and can confirm that it is indeed faster now. Good job, let's close this issue then.

bizzinho commented 2 years ago

Requires additional unit tests before it can be marked as completed.