amakelov / mandala

A simple & elegant experiment tracking framework that integrates persistence logic & best practices directly into Python
Apache License 2.0
506 stars 15 forks source link

`BLOB longer than INT_MAX bytes` 100s of MB of data from data frames. #16

Open etep opened 1 month ago

etep commented 1 month ago
File [some-path/.venv/lib/python3.12/site-packages/mandala/storage_utils.py:140](http://localhost:8888/~/Repos/yfacrypto/.venv/lib/python3.12/site-packages/mandala/storage_utils.py#line=139), in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)",
    142         (key, serialize(value)),
    143     )

OverflowError: BLOB longer than INT_MAX bytes

Hi @amakelov,

My setup is all python and roughly the following: Functions are imported into a notebook from around the code base, the notebook invokes those functions to:

  1. read data frames (each 100s of MB in size).
  2. Transform the values columns, create X and y (these transformations are likely to evolve).
  3. Run ML fits and out-of-sample evaluations on the results.

I was excited to try @op and wanted it to memoize the results of (1) and (2) above. I think the error results from having results that are "too big" i.e. exceed the row capacity of a row in the database (e.g. I ran the Notebook with @op and with a small X matrix and it appeared to work).

etep commented 1 month ago

Hi @amakelov,

To follow up, I tried this experiment:

  1. Loaded the data frames (100s of MBs).
  2. Transformed the values, and selected 5000 rows (so X is something like 5000x10).
  3. Fit the model.

I have @op decorating some functions: generate_features, get_xs_and_ys, and fit. Importantly, load_data_frames is not @op decorated, although it is called from inside of an @op decorated function.

I can run this code on a small/toy data frame and it all works (e.g. re-running with a full restart skips all the work - loading of data frames, scanning the rows, transforming values - and returns the same results). When I point it back to the data that I want to work with - even though I stop at 5000 rows - I get OverflowError: BLOB longer than INT_MAX bytes. I thought that was strange... i.e. there should not be that much data to memoize.

etep commented 1 month ago

So far, I've found two issues:

  1. At some point, I wanted to unwrap something inside of an @op decorated function. So... I passed instorage as a function argument. This may be the root cause of some other strange behavior, but it definitely caused a large blob insertion into the db.
  2. I had @op decorated a function that accepted the data frames as input arguments. Arguably, this is in scope (i.e. as compared to issue (1) above). But the data frames were very big, so the db insert failed. This was easy enough to fix -- but perhaps there is a way to guard against or notify users when they are caching something that is very large?

Catching the failure at its inception would be helpful on both points above.

And... why do we need to store the content of the input arguments? Is the hash/uid/cid/hid not enough to uniquify the memoized output?

amakelov commented 1 month ago

Thanks @etep for reporting this. I see several separate things discussed here that I'll respond to individually:

Storing large objects

I've been able to kind-of reproduce with the code below, though I get a different kind of error:

from mandala.imports import *
import pandas as pd
import numpy as np

works = np.random.uniform(size=(10000, 10000))
fails = np.random.uniform(size=(20000, 10000))

@op
def inc(x):
    return x + 1

storage = Storage()

with storage:
    inc(works) # this works fine it seems

with storage:
    inc(fails) # too big, fails
File [...]/mandala/storage_utils.py:140, in SQLiteDictStorage.set(self, key, value, conn)
    136 @transaction
    137 def set(
    138     self, key: str, value: Any, conn: Optional[sqlite3.Connection] = None
    139 ) -> None:
--> 140     conn.execute(
    141         f\"INSERT OR REPLACE INTO {self.table} (key, value) VALUES (?, ?)\",
    142         (key, serialize(value)),
    143     )

InterfaceError: Error binding parameter 1 - probably unsupported type.

I think we might be getting a different error message because INT_MAX is set at compile-time for SQLite; or due to some other difference between versions/compile-time configs (not very confident though).

At any rate, it is impractical to save large objects in SQLite, so I propose automatically storing large objects in an "overflow directory" as .joblib files. This will be much more efficient for reads/writes too!

Add the ability not to save @op inputs

There are several ways to enable this, in decreasing order of flexibility:

I'm in favor of the 1st bullet point, though I can see how the 2nd and 3rd can be useful complements. Let me know if you have any thoughts!

etep commented 1 month ago

Hi @amakelov,

Yes, the issue I had reported was initially caused by some amount of learning curve on my side, but I do agree we can split it into its individual components. Will take a look at your proposals and comments and send back. Thank you!

amakelov commented 1 week ago

e6422b1 added a preliminary functionality to store large values in joblib files as opposed to SQLite.

Here's a small example:

storage = Storage(db_path='test.db', overflow_dir='overflow/', overflow_threshold_MB=10)

@op
def make_array(n: int) -> np.ndarray:
    return np.random.rand(n)

with storage:
    for i in (10, 100, 1000, 10_000, 100_000, 1_000_000, 10_000_000, 100_000_000,):
        make_array(i)

The two largest arrays will be (at least on my machine?) stored in the overflow/ dir