No easy way to stop tracking a global variable that's not needed

amakelov commented 3 months ago

Related to #18 : if we ever tracked a global variable that's very large, all subsequent with storage: contexts will try to check this variable for changes (and thus waste time hashing it) even if this variable is no longer used anywhere.

More generally, there should be an easy way to forget dependencies

bohblue2 commented 2 months ago

It seems that the process is getting slower as I deal with larger datasets. This issue appears to be urgent, so I am currently analyzing it. One idea I have is to optimize the __hash__ function.

amakelov commented 2 months ago

Hi! Can you be more specific about the problem you encounter?

Assuming it's about the tracking of large global variables - there is a fix now (though it could be documented better). The idea is to simply assign a hash to large global variables at definition time (by wrapping them in a Ref object). This way, they get hashed only once at startup. The solution is explained and illustrated in the tutorials.

If this is still too much overhead, consider simply not using global variables. You can replace a global variable by a call to an @op that returns the value, e.g.:

@op
def get_my_global() -> Any:
    # some logic to read a file or whatever
    GLOBAL_VAL = ...
    return GLOBAL_VAL

This way, you only serialize and hash the value once, and then only pass the wrapped object to other @ops. A recent issue regarding storage of large values may be of interest: #16. One thing to note though is that this approach will store a copy of the global value in the mandala storage. Currently mandala doesn't have good support for storing large objects in custom formats (it's either a SQLite blob or a joblib dump) - but issues/proposals are welcome!

Finally, keep in mind that the versioning part of mandala still has a lot of kinks to iron out, and I appreciate you opening more issues about it!

bohblue2 commented 2 months ago

Thank you! I will figure it out more. Currently i am not familiar with concept of CF.

I thought the hashing part was the problem, but when I ran the profiler, the persistenc I/O was the biggest bottleneck.

I am using arcticDb(on other project) as a my pandas backend for massively big data.(arcticDb is solid and hard-tested lib. It is very good)

I'm thinking it would probably be faster to make the backend with arcticdb, so I'm implementing that.

I am codereviewing your project because I like this project very much and want to contribute a lot. Thank you so much for creating such a project.

bohblue2 commented 2 months ago

Could you please review if https://github.com/man-group/ArcticDB can be adopted as a persistence layer(same layer like sqlite), I think it seems quite possible (performance wise) and their LMDB is really fast.

additional information, I'm dealing with 15TB of market micro data (130TB uncompressed) with arcticdb.

I'd like to know what you think before I do a full scale implementation.

ps. Another candidate is https://github.com/ibis-project/ibis.

bohblue2 commented 2 months ago

Could you please review if https://github.com/man-group/ArcticDB can be adopted as a persistence layer(same layer like sqlite), I think it seems quite possible (performance wise) and their LMDB is really fast.

additional information, I'm dealing with 15TB of market micro data (130TB uncompressed) with arcticdb.

I'd like to know what you think before I do a full scale implementation.

ps. Another candidate is https://github.com/ibis-project/ibis.

Never mind, I just gonna try.

amakelov commented 2 months ago

Hi - sorry for missing this the first time!

I don't have time to do a deep dive on this, but I can point out some things that might help.

All the implementation details of storage backends are restricted to the file https://github.com/amakelov/mandala/blob/master/mandala/storage_utils.py and it should hopefully stay this way in the future, so add any new implementations there and make sure they follow the appropriate interfaces.

The storages are divided along two main dimensions:

in-memory vs persistent.
for storing Call objects vs everything else (this being: "atoms" - which are serialized Ref values, "shapes" - which are Refs with the .obj field removed, "ops" - which are function objects decorated with @op and their metadata, and "sources" - which optionally tracks versioning information).

I think the thing you're interested in is creating another implementation of SQLiteDictStorage https://github.com/amakelov/mandala/blob/master/mandala/storage_utils.py#L148 that is based on ArcticDB. This class is essentially a key-value store, so it doesn't matter if you use a full relational database or something simpler, and it should be easy to implement over ArcticDB.

I also recommend looking at the recently added JoblibDictStorage which can now be optionally used for values that serialize to something above a threshold. Ideally you will be able to benchmark ArcticDB against this simple alternative.

Good luck!

amakelov / mandala

No easy way to stop tracking a global variable that's not needed #19