Open ChawDoe opened 1 year ago
@davidbuniat Thanks. It's really urgent for me.
Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.
Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.
Thanks! I hope that I have explained my use case clearly. Maybe I need function likes this:
ds = deeplake.distributed_dataset('xxx')
ds.distributed_append(xxx)
ds.distributed_commit(xxx)
ds.distributed_append_auto_commit(xxx)
where auto commit will find the best memory-time trade-off in a for loop.
@FayazRahman Hi, do you have any updates on this?
Sorry @ChawDoe, I haven't been able to work on this yet, I will update here as soon as I make any headway regarding this.
Description
Here is my use case: I have 4 gpu nodes for training (including compute tensors) on aws. I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training. I use accelerate as my distributed parallel framework. So my framework works like this:
Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction. The problem includes:
So Is there any feature to transform custom dataset to deeplake dataset? If we have a function which works like this:
or could you give me a standard workflow to solve this? I don't know which is the best method for this scenario. The document did not cover this problem. #2596 also indicates this problem.
Use Cases
Distributed parallel computing and saving to deeplake.