activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Mozilla Public License 2.0
8.02k stars 614 forks source link

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

Open ChawDoe opened 11 months ago

ChawDoe commented 11 months ago

Description

Here is my use case: I have 4 gpu nodes for training (including compute tensors) on aws. I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training. I use accelerate as my distributed parallel framework. So my framework works like this:

deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
  with torch.no_grad():
    a = net_a_frozen(data_dict['a'])
    b = net_b_frozen(data_dict['b'])
  # loss = net_c_training(a, b)
  # the loss is only used in training.
  save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
  append_to_deeplake(deeplake_path, save_dict)
  if index % 100 == 0:
    commit_to_deeplake(deeplake_path)

Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction. The problem includes:

  1. I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
  2. I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
  3. The frequent append and commit function takes me a lot of time.
  4. detach() and to_cpu() function takes me a lot of time.

So Is there any feature to transform custom dataset to deeplake dataset? If we have a function which works like this:

ds.distributed_append_gpu_tensor_and_auto_commit(data_tensor)
ds.auto_transorm_pytorch_dataset(my_pytorch_dataloader)

or could you give me a standard workflow to solve this? I don't know which is the best method for this scenario. The document did not cover this problem. #2596 also indicates this problem.

Use Cases

Distributed parallel computing and saving to deeplake.

ChawDoe commented 11 months ago

@davidbuniat Thanks. It's really urgent for me.

FayazRahman commented 11 months ago

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

ChawDoe commented 11 months ago

Hey @ChawDoe! Thanks for opening the issue. Let us look into whether any of our current workflows will satisfy your use case and we'll get back to you in a few days.

Thanks! I hope that I have explained my use case clearly. Maybe I need function likes this:

ds = deeplake.distributed_dataset('xxx')
ds.distributed_append(xxx)
ds.distributed_commit(xxx)
ds.distributed_append_auto_commit(xxx)

where auto commit will find the best memory-time trade-off in a for loop.

ChawDoe commented 11 months ago

@FayazRahman Hi, do you have any updates on this?

FayazRahman commented 11 months ago

Sorry @ChawDoe, I haven't been able to work on this yet, I will update here as soon as I make any headway regarding this.