basalt-org / basalt

A Machine Learning framework from scratch in Pure Mojo 🔥
https://basalt-docs.vercel.app/
Other
403 stars 26 forks source link

Redo Dataset and Dataloader #90

Open Benny-Nottonson opened 5 months ago

josiahls commented 4 months ago

I don't want to lick the cookie, but one of the things I'm excited about mojo is the type safety / memory management.

What are everyone's thoughts on torchdata ? I experimented with building a RL framework off it fastrl

Pros

Cons

Some things I'm seeing that would be needed from mojo:

Major blockers

Minor needs

I'm curious what other frameworks / libs people have used, liked, disliked.

StijnWoestenborghs commented 4 months ago

Hi @josiahls, I think data pipelining is quite a complex, but important, topic that Basalt might not be focusing on soon, at least not in the near future. What I read about it is that torchdata suffers from lower level control over things like multiprocessing, and even though it should be possible in Mojo, other then algorithm.parallelize it doesn't have something like a threading API (yet!).

For sure the type safety and safely passing through references to the data without copies will and must be possible. And as a first rework of the current dataloader (which just simply loads all data in memory), I think an ultra simple pipeline that 'chunk-loads' the data in memory & passes it to the model like that should be the goal. Additionally Mojo might have an edge here with it's very convenient and easy to use compile time features. Are you perhaps interested in trying this out?

Long term thinking. I can see cloud storage integration & distributed computing being massively important here as well. And I wonder if that was one of the re-design evaluations of torchdata.