Closed GxjGit closed 1 year ago
Hi @GxjGit , thanks for reaching out. This is not expected and contrary to some third party benchmarks for Deep Lake.
I'm tagging @AbhinavTuli to look into your example benchmark and follow up with troubleshooting!
Thanks a lot for bringing this to our attention.
@mikayelh, thank you for your reply. Looking forward to your results, if there is any information that can help to troubleshoot the problem, I am very happy to provide it here.
thanks for raising the issue, @GxjGit can you please share your code for dataloader and which cloud provider have you been using for storing the data including for webdataset?
thanks for raising the issue, @GxjGit can you please share your code for dataloader and which cloud provider have you been using for storing the data including for webdataset?
Thanks @davidbuniat. (1). The use of native dataloader is: https://github.com/pytorch/examples/blob/main/imagenet/main.py
Running cmd: python main.py -a resnet18 --workers 16 --b 256 dir
dir is the path of imagenet dataset,
which can be downloaded from https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
(2). The use of dl dataloader is referred from tutorial: https://datasets.activeloop.ai/docs/ml/datasets/imagenet-dataset/
import deeplake
#ds = deeplake.load('hub://activeloop/cifar10-train')
ds = deeplake.load('/home/data/dataset_cifar/')
keys = ds.tensors.keys() # dict_keys(['images', 'labels'])
label_num = ds.labels[0].numpy() # array([6], dtype=uint32)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
tform = transforms.Compose([
transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
# transforms.RandomRotation(20), # Image augmentation
transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
train_loader = ds.pytorch(num_workers=16, batch_size=256, transform={
'images': tform, 'labels': None}, shuffle=True)
Replace train_loader with the original train_loader in https://github.com/pytorch/examples/blob/main/imagenet/main.py
(3) All my dataset is stored in local object storage called chubaofs. chubaofs may not be available to you, so you can test this code in local storage. I have tested deeplake in local storage, we get the similar performance.
@mikayelh Hello, Have you looked into my example, any progress on troubleshooting?
hi @GxjGit, I know @AbhinavTuli has been working on this. @AbhinavTuli can you provide us with an interim update? cc: @tatevikh
Thanks for your patience @GxjGit, we are still looking into the issue. The root reason for slowdown is not identified, but some insights that we are looking into on our side.
Thanks for your update @davidbuniat, the following conclusions are different from yours. (1) Cifar 10 roughly ~580 on my side (with transform), with new experimental dataloader. (2) In the new experimental dataloader streaming from hub remote storage (hub://activeloop/imagenet-train) getting 350 images/s (both with transform and without transform).
without transform:
Hello, I'm also intrested in the high performance of Deep Lake in experiments in the paper Deep Lake: a Lakehouse for Deep Learning. I wonder if there is experiment code in this repository so that I can reproduce the experiment C as the pic shown in the paper? Thanks very much!
hey @kisseternity , i'm not sure it's public, tagging @levongh to share more on this subject. There's also a wide array of benchmarks in this third-party review by Yale researchers, but I couldn't find the code in arxiv. https://arxiv.org/abs/2209.13705 (most of these are from there).
@GxjGit currently we are working to identify the root cause of the problem
Hello, I'm also intrested in the high performance of Deep Lake in experiments in the paper Deep Lake: a Lakehouse for Deep Learning. I wonder if there is experiment code in this repository so that I can reproduce the experiment C as the pic shown in the paper? Thanks very much!
hey @kisseternity, unfortunately now, we do not have any open benchmarks repository, we will come up with one in near future.
As @mikayelh already mentioned the only open benchmarks are done by Yale researchers and the whole code base is here https://github.com/smartnets/dataloader-benchmarks
I'm also facing poor performance when using Deep Lake, compared to Webdataset (which is really fast). My use-case is around MP3 audio files. I'll try to create a reproducible setup for this.
I'm also facing poor performance when using Deep Lake, compared to Webdataset (which is really fast). My use-case is around MP3 audio files. I'll try to create a reproducible setup for this.
yes, reproducible script would help us a lot and we gonna make sure that each use case is optimized, thanks for your help :)
@GxjGit Can you please also provide your system i formation, so we will identify and resolve the performance issues that you can see?
@GxjGit Can you please also provide your system i formation, so we will identify and resolve the performance issues that you can see?
Hi, @levongh The information is as following:
Hi, @levongh can this problem be fixed in the near future? We are considering whether to use this solution according to the plan to solve this problem.
Hey @GxjGit! We have a solution for the slowdown, it will be included in the next release before the end of this week.
Hey @GxjGit! We just released deeplake==3.0.12 which greatly improves performance when dealing with jpeg and png data. My numbers while streaming imagenet data from activeloop storage(hub://activeloop/imagenet-train) to a c5.4xlarge ec2 instance in the same region (us-east-1) were close to 1700-1800 images per second using the experimental dataloader. One thing to keep in mind is that now the data sent to transform for jpeg and png tensors will be a PIL image, so ToPILImage transform should be removed.
Looking forward to hearing your results after the update!
Hi @AbhinavTuli , thank you for updating. I have upgraded the deeplake version to 3.0.11. But encountered an error:
My code is as following:
from deeplake.experimental import dataloader
ds = deeplake.load('/home/code/deeplake/dataset_imaget')
keys = ds.tensors.keys() # dict_keys(['images', 'labels'])
label_num = ds.labels[0].numpy() # array([6], dtype=uint32)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
tform = transforms.Compose([
#transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
# transforms.RandomRotation(20), # Image augmentation
transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
train_loader = dataloader(ds)\
.transform(tform)\
.batch(256)\
.shuffle()\
.pytorch(tensors=['images', 'labels'], num_workers = 8)
Hey, thanks for the feedback. This should be a quick fix. There's an issue with our shuffle buffer + handling the new implementation of jpeg+png. We should have a minor release for this today.
In the meantime, you can modifiy the existing code to the following:- 2 changes were made, the transform syntax was fixed and shuffle was removed.
train_loader = dataloader(ds)\
.transform({"images":tform, "labels": None})\
.batch(256)\
.pytorch(num_workers = 8)
Will ping you shortly once the new release with the shuffle issue is fixed. Thanks for your patience.
Also, please use 3.0.12, not 3.0.11
Also, please use 3.0.12, not 3.0.11 @AbhinavTuli , ok, I have changed to version 3.0.12, and modified my code as well. It seemed that the code can run succeessfully, but another error countered:
@GxjGit can you please confirm if you are running the training on "imagenet-train", is the transform function the same that you provided before?
Hi @levongh,
I ran the trainning on my local dataset, referred to https://docs.activeloop.ai/tutorials/creating-datasets/creating-object-detection-datasets . I will check if there is something wrong with my dataset.
The transform function is the same that i provided before: (removed ToPILImage) tform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
# transforms.RandomRotation(20), # Image augmentation
transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
One guess I have is that perhaps one of your images is RGBA and has 4 channels, which makes transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1))
return 0, 224, 224
But that should raise an error in the normalize function call itself, so we're not sure yet.
@GxjGit we have released deeplake==3.0.13 which fixes the issue with shuffling. We're yet to reproduce the error you reported where you ended up with a tensor with 0 channels, do let us know if you find any more information to help us reproduce it.
One guess I have is that perhaps one of your images is RGBA and has 4 channels, which makes
transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1))
return 0, 224, 224 But that should raise an error in the normalize function call itself, so we're not sure yet.
@davidbuniat I suspect there is something wrong with my local dataset,so I built it yesterday, but it ended up with this error:
I am not sure if this error also occurs whiling building the last dataset. The files in imagetnet are all seemed to jpeg fromat, they all should be RGB, not RGBA, I am not sure about it. I built this dataset last night with version of 3.0.5, I am updating to 3.0.14, and trying to build it again. the script is as following:
import deeplake
import numpy as np
import os
base_path = "/home//imagenet-pytorch/"
ds = deeplake.empty('/home/dataset_imaget') # Create the dataset locally
base_path1 = "/home/imagenet-pytorch/train"
images = []
labels = []
with open(os.path.join(base_path, 'meta/train_labeled.txt'), 'r') as fin:
for line in fin:
image, lable = line.strip().split(' ', 1)
images.append(image)
labels.append(int(lable))
with ds:
ds.create_tensor('images', htype='image', sample_compression = 'jpeg')
ds.create_tensor('labels', htype='class_label', class_names = labels)
for image in images:
# Append data to tensors
ds.append({'images': deeplake.read(os.path.join(base_path1, image)),
'labels': labels[i]
})
print("build dataset suceess.")
@GxjGit Seems like you might have 1 or more png files that have been renamed to jpeg. One simple way to check or skip these files is to do this:-
for image in images:
image_sample = deeplake.read(os.path.join(base_path1, image))
if image_sample.shape[2] == 4:
compression = image_sample.compression
print(f"Skipping {image} with compression {compression} because it has 4 channels.")
continue
@AbhinavTuli Thanks a lot. With your tips, we rebuilt the deeplake dataset of imagenet and ran successfully. Compared to the pre version, the speed raise from 350 fps to 1230 fps. It performs well when only load and transform the data. But when I integrate it into training process, its performance drops a lot. Compare it with pytorch, webdataset.
deeplake version: 3.0.13 batch_size = 256. num_worker: 16
@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?
@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?
Hi @GxjGit, sorry for the delayed response. We are currently working on a couple of improvements and will have preliminary results by tomorrow. We will keep you updated with the updates.
@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?
Hi @GxjGit, sorry for the delayed response. We are currently working on a couple of improvements and will have preliminary results by tomorrow. We will keep you updated with the updates.
We examined a few avenues of improvement but they did not materialize into meaningful speed increases. We're going to perform more detailed profiling within the next two weeks and will get back to you with the results as soon as they're done.
@levongh ok, Thanks for your reply.
Closing for now. Will reopen once we improve the benchmarks.
@davidbuniat @levongh @tatevikh Hi, we are still following your project. I update deep lake to version 3.2.13. Found a few problems. (1) Have you removed deeplake.experimental from new version? (2) We found that the performance in 3.2.13 in worse than 3.0.13(with deeplake.experimental), the performance seems to be back to original level. (284 images/s -> 160 images/s)
Please help me to check the above performance data is correct. And when will you improve the benchmarks?
Hey @GxjGit
Re (1) - deeplake.experimental was moved to the enterprise API, and the query and dataloader is accessed via ds.query()
and ds.dataloader().pytorch()...
Link.
Re (2) - Which API did you use to measure the performance. If you used ds.pytorch()
, that's using the pure python dataloader. To access the C++ dataloader (previously the experimental loader), you'll want to use ds.dataloader().pytorch()...
.
The enterprise API is only available to users on Growth Plan or higher, which comes with a 14-day free trial. The easiest way to try it out is to sign up in our APP, and install deeplake via pip install deeplake[enterprise]
.
Hi @istranic
I want to try your enterprise API in my local env, for evaluating performance. How can i be access to it. Or does deeplake surpass pytorch native dataloader in end-to-end training resnet50? Can you show your test data for me like the following.
the env of our test: num_worker: 16 batch_size: 256 dataset: imagenet storage: object storage / Local storage GPU: 1 V100 cpu: 7 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Hey @GxjGit
The enterprise API is only available to users on Growth Plan or higher, which comes with a 14-day free trial. The easiest way to try it out is to sign up in our APP, and install deeplake via pip install deeplake[enterprise].
We will also run the benchmark in red above, over the next week, and I expect our enterprise dataloader is similar to the pytorch dataloader in terms of performance for local dataset. Pls note that Deep Lake is not explicitly optimized to be a faster for local data. Most of our optimization are for streaming applications, so that's where you'd expect to see significant improvements over non-Deep Lake approaches.
Hi @istranic Thanks for reply. I am looking forward to your test results.
Compared to pytorch native dataloader, I find that the speed of deeplake dataloader reaches 165 images/s, while the native loader is up to 819.15 images/s, the webdataset loader is 1404 images/s. deeplake dataloader is more than 5 times slower.
the imformation of our test: num_worker: 16 batch_size: 256 dataset: imagenet storage: object storage / Local storage our local dataset building is referred to: https://docs.activeloop.ai/tutorials/creating-datasets/creating-object-detection-datasets
trainning test script: https://github.com/pytorch/examples/blob/main/imagenet/main.py
we also test the cifar10 dataset and draw the similar conclusion. Does this conclusion meet expectations? If not, what should it be like? Is there a benchmark example?