Open amakelov opened 3 months ago
It seems that the process is getting slower as I deal with larger datasets. This issue appears to be urgent, so I am currently analyzing it. One idea I have is to optimize the __hash__
function.
Hi! Can you be more specific about the problem you encounter?
Assuming it's about the tracking of large global variables - there is a fix now (though it could be documented better). The idea is to simply assign a hash to large global variables at definition time (by wrapping them in a Ref
object). This way, they get hashed only once at startup. The solution is explained and illustrated in the tutorials.
If this is still too much overhead, consider simply not using global variables. You can replace a global variable by a call to an @op
that returns the value, e.g.:
@op
def get_my_global() -> Any:
# some logic to read a file or whatever
GLOBAL_VAL = ...
return GLOBAL_VAL
This way, you only serialize and hash the value once, and then only pass the wrapped object to other @op
s. A recent issue regarding storage of large values may be of interest: #16. One thing to note though is that this approach will store a copy of the global value in the mandala
storage. Currently mandala
doesn't have good support for storing large objects in custom formats (it's either a SQLite blob or a joblib
dump) - but issues/proposals are welcome!
Finally, keep in mind that the versioning part of mandala
still has a lot of kinks to iron out, and I appreciate you opening more issues about it!
Thank you! I will figure it out more. Currently i am not familiar with concept of CF.
I thought the hashing part was the problem, but when I ran the profiler, the persistenc I/O was the biggest bottleneck.
I am using arcticDb(on other project) as a my pandas backend for massively big data.(arcticDb is solid and hard-tested lib. It is very good)
I'm thinking it would probably be faster to make the backend with arcticdb, so I'm implementing that.
I am codereviewing your project because I like this project very much and want to contribute a lot. Thank you so much for creating such a project.
Could you please review if https://github.com/man-group/ArcticDB can be adopted as a persistence layer(same layer like sqlite), I think it seems quite possible (performance wise) and their LMDB is really fast.
additional information, I'm dealing with 15TB of market micro data (130TB uncompressed) with arcticdb.
I'd like to know what you think before I do a full scale implementation.
ps. Another candidate is https://github.com/ibis-project/ibis.
Could you please review if https://github.com/man-group/ArcticDB can be adopted as a persistence layer(same layer like sqlite), I think it seems quite possible (performance wise) and their LMDB is really fast.
additional information, I'm dealing with 15TB of market micro data (130TB uncompressed) with arcticdb.
I'd like to know what you think before I do a full scale implementation.
ps. Another candidate is https://github.com/ibis-project/ibis.
Never mind, I just gonna try.
Hi - sorry for missing this the first time!
I don't have time to do a deep dive on this, but I can point out some things that might help.
All the implementation details of storage backends are restricted to the file https://github.com/amakelov/mandala/blob/master/mandala/storage_utils.py and it should hopefully stay this way in the future, so add any new implementations there and make sure they follow the appropriate interfaces.
The storages are divided along two main dimensions:
Call
objects vs everything else (this being: "atoms" - which are serialized Ref
values, "shapes" - which are Ref
s with the .obj
field removed, "ops" - which are function objects decorated with @op
and their metadata, and "sources" - which optionally tracks versioning information).I think the thing you're interested in is creating another implementation of SQLiteDictStorage
https://github.com/amakelov/mandala/blob/master/mandala/storage_utils.py#L148 that is based on ArcticDB
. This class is essentially a key-value store, so it doesn't matter if you use a full relational database or something simpler, and it should be easy to implement over ArcticDB
.
I also recommend looking at the recently added JoblibDictStorage
which can now be optionally used for values that serialize to something above a threshold. Ideally you will be able to benchmark ArcticDB
against this simple alternative.
Good luck!
Related to #18 : if we ever tracked a global variable that's very large, all subsequent
with storage:
contexts will try to check this variable for changes (and thus waste time hashing it) even if this variable is no longer used anywhere.More generally, there should be an easy way to forget dependencies