DagsHub / streaming-client

MIT License
2 stars 0 forks source link

Added Dataloaders #27

Closed jinensetpal closed 1 year ago

jinensetpal commented 1 year ago

Frameworks:

Features:

TODO:

deanp70 commented 1 year ago

@jinensetpal chiming in here since I'm trying to use your branch now - getting an error I wasn't getting last week when I tried:

torch_dl = train_set.all().as_dataloader(flavor='torch')

Colab returns:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
[<ipython-input-25-5ff08eff9061>](https://localhost:8080/#) in <cell line: 1>()
----> 1 torch_dl = train_set.all().as_dataloader(flavor='torch')

[/usr/local/lib/python3.10/dist-packages/dagshub/data_engine/client/models.py](https://localhost:8080/#) in as_dataloader(self, flavor, **kwargs)
    135         elif flavor == 'torch':
    136             dataset_kwargs = set(list(inspect.signature(PyTorchDataset).parameters.keys())[1:])
--> 137             return DataLoader(self.as_dataset(flavor, **dict(map(lambda key: (key, kwargs[key]), set(kwargs.keys()).intersection(dataset_kwargs)))),
    138                               **dict(map(lambda key: (key, kwargs[key]), kwargs.keys() - dataset_kwargs)))
    139         elif isinstance(flavor, tensorflow.data.Dataset): return TensorFlowDataLoader(flavor, **kwargs)

NameError: name 'DataLoader' is not defined

EDIT: I see a couple commits ago you removed a snippet that seems necessary for this to work:

import inspect
import tensorflowfrom torch.utils.data import DataLoader
from .loaders import PyTorchDataset, TensorFlowDataLoader, TensorFlowDataset
deanp70 commented 1 year ago

After fixing the above, I'm also getting:

WARNING:dagshub.data_engine.client.loaders:`tensorizer` set to 'auto'; guessing the datatype
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-11-5ff08eff9061>](https://localhost:8080/#) in <cell line: 1>()
----> 1 torch_dl = train_set.all().as_dataloader(flavor='torch')

5 frames
[/usr/local/lib/python3.10/dist-packages/dagshub/data_engine/client/loaders.py](https://localhost:8080/#) in _download(self, entry)
    151 
    152         if not (self.savedir / entry.path).is_file():
--> 153             data = self.repo.get_file(f'{self.datasource_root}/{entry.path}')
    154             with open(self.savedir / entry.path, 'wb') as file:
    155                 file.write(data)

AttributeError: 'DataSetDownloader' object has no attribute 'datasource_root'

Edit: Solved it by passing self.datasource_root from the PyTorchDataset object down to the DatasetDownloader and everything.

jinensetpal commented 1 year ago

@deanp70 thanks, patched!!

I'll setup some unit tests to ensure updates don't break functionality elsewhere.

deanp70 commented 1 year ago

@jinensetpal I'm trying to get my SavtaDepth to work, and PyTorch is complaining that it expects a float but is getting a byte in the tensor – found this thread, and will fix it in my version, but probably good to solve for this in your code too:

https://stackoverflow.com/questions/64635630/pytorch-runtimeerror-expected-scalar-type-float-but-found-byte

jinensetpal commented 1 year ago

@kbolashev done! I think it's ready for a final review!

kbolashev commented 1 year ago

@jinensetpal Ok, we actually have a problem - https://github.com/DagsHub/streaming-client/blob/2f28ab06b79fc332a35e640a263392a298b086dc/dagshub/data_engine/client/loaders.py#L28 Since the DataLoaders are inheriting from PyTorch dataloaders, we now have a hard dependency on it. Can you see if you can either: 1) Delete the inheritance and see if it still works 2) If 1 doesn't work, then hide the class initialization behind a module guard maybe? But then you need to also make sure that TensorFlow dataloader doesn't break

I merged the PR in and then it broke on the first script I ran because I didn't have PyTorch installed