Make polars frames lazy and stream into csv

coroa commented 4 months ago

Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.

We should check each of the invocations of write_lazyframe that explain(streamable=True) shows it can actually run the streaming pipeline.

If you decide to merge, please squash (the history is ugly :))

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.65%. Comparing base (f0c8457) to head (d82e97e).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #294 +/- ## ========================================== - Coverage 89.69% 89.65% -0.05% ========================================== Files 16 16 Lines 4019 4021 +2 Branches 939 941 +2 ========================================== Hits 3605 3605 - Misses 281 284 +3 + Partials 133 132 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

FabianHofmann commented 4 months ago

Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long.

Original Pandas Based

mem-pandas

Polars Based (Non-lazy)

mem-polars-non-lazy

Polars Based (lazy)

mem-lazy

Code for running the benchmark

import pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Flag to control the monitoring loop
stop_monitoring = False

# List to store memory usage values
memory_values = []

# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
    global stop_monitoring
    global memory_values
    process = psutil.Process()
    while not stop_monitoring:
        mem_info = process.memory_info()
        memory_values.append(mem_info.rss / 1024 ** 2)  # Store memory in MB
        time.sleep(interval)

# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True  # Daemonize thread
monitor_thread.start()

# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()

m.to_file("test.lp", io_api="lp-polars")

# Stop monitoring
stop_monitoring = True
monitor_thread.join()

# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values))

fneum commented 4 months ago

Interesting that there is no memory savings in either case compared to the other two.

coroa commented 4 months ago

Thanks for the profiling. Very disappointing.

coroa commented 4 months ago

It's possible that .values.reshape(-1) is not zero-copy.

coroa commented 4 months ago

The lazy version has to do everything at least twice, since the check_nulls already needs to eval everything (that could be improved). I don't know why factor 4, though.

coroa commented 4 months ago

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

FabianHofmann commented 4 months ago

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version.

PyPSA / linopy