Open Benny-Nottonson opened 5 months ago
Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize
it doesn't have something like a threading API (yet!).
For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?
Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.
I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.
What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl
Pros
Cons
A -> B -> C
is valid in python? e.g How do we know those pipes plug into eachother correctly? Python doesn't have type safety, so unless we somehow check the signatures in python / use pydantic this doesn't appear possibleA -> B -> C
, and there is an exception inA
, you will get a long stack trace all the way up the pipeline. I feel like this might be the achilles heel of a lot of pipeline dataloader frame works.Some things I'm seeing that would be needed from mojo:
Major blockers
Iterable / Iterator / Gettable
traits that pipes can implement.Minor needs
yield / coreoutines
. I think a working pipeline can be hack around this for now.I'm curious what other frameworks / libs people have used, liked, disliked.