Redo Dataset and Dataloader

I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.

What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl

Pros

Linking pipelines was very cool and making transforms was easy and I could build some very complex pipelines using it personally and for work.
Very horizontal inheritance. Learning how to do custom stuff / tear apart torchdata was very easy since the hierarchy was basically flat. All pipes inherited from IterDataPipe or MapDataPipe. I think onboarding new users is a lot easier because of this.
- I was surprised how important this was. An issue I've seen from a lot of dataloading frameworks is they turn into OOP hell, and thus are very hard to extend. My understanding is ray data has this issue from talking to other research friends that tried using it / extending it.

Cons

The future of torchdata is hazy, and they vaguely/unhelpfully noted they need to redesign some stuff. Below are my guesses.
Limitations related to python:
- How do you verify pipeline A -> B -> C is valid in python? e.g How do we know those pipes plug into eachother correctly? Python doesn't have type safety, so unless we somehow check the signatures in python / use pydantic this doesn't appear possible
- How do you pass values / references between pipes reliably? e.g. You want to cache data at certain points in the pipeline, but don't want to duplicate the data from earlier in the pipeline.
- If you have a pipeline and want to do multiprocessing, how do you nicely get around the python GIL?
  - torchdata was recently testing a dataloader2 that uses pub/sub/messaging but doesn't look like that got anywhere?
Limitations not related to python
- Exception messages in pipelines (torchdata or not) are simply awful. If you have a pipeline A -> B -> C, and there is an exception in A, you will get a long stack trace all the way up the pipeline. I feel like this might be the achilles heel of a lot of pipeline dataloader frame works.
  - I think mojo has inlining / nodebug capabilities that can make this not so bad (skip internal functions), which would be otherwise not possible in python (?)
  - Probably needs an innovation here: Modify the exception / stack trace when using the pipelines so the stack traces are easier to read.

Some things I'm seeing that would be needed from mojo:

Major blockers

Iterable / Iterator / Gettable traits that pipes can implement.

Minor needs

yield / coreoutines. I think a working pipeline can be hack around this for now.

I'm curious what other frameworks / libs people have used, liked, disliked.

Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize it doesn't have something like a threading API (yet!).

For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?

Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.

basalt-org / basalt

Redo Dataset and Dataloader #90