activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Mozilla Public License 2.0
8.07k stars 616 forks source link

Poor performance #1931

Closed GxjGit closed 1 year ago

GxjGit commented 1 year ago

Compared to pytorch native dataloader, I find that the speed of deeplake dataloader reaches 165 images/s, while the native loader is up to 819.15 images/s, the webdataset loader is 1404 images/s. deeplake dataloader is more than 5 times slower.

the imformation of our test: num_worker: 16 batch_size: 256 dataset: imagenet storage: object storage / Local storage our local dataset building is referred to: https://docs.activeloop.ai/tutorials/creating-datasets/creating-object-detection-datasets image

trainning test script: https://github.com/pytorch/examples/blob/main/imagenet/main.py

we also test the cifar10 dataset and draw the similar conclusion. Does this conclusion meet expectations? If not, what should it be like? Is there a benchmark example?

mikayelh commented 1 year ago

Hi @GxjGit , thanks for reaching out. This is not expected and contrary to some third party benchmarks for Deep Lake.

I'm tagging @AbhinavTuli to look into your example benchmark and follow up with troubleshooting!

Thanks a lot for bringing this to our attention.

GxjGit commented 1 year ago

@mikayelh, thank you for your reply. Looking forward to your results, if there is any information that can help to troubleshoot the problem, I am very happy to provide it here.

davidbuniat commented 1 year ago

thanks for raising the issue, @GxjGit can you please share your code for dataloader and which cloud provider have you been using for storing the data including for webdataset?

GxjGit commented 1 year ago

thanks for raising the issue, @GxjGit can you please share your code for dataloader and which cloud provider have you been using for storing the data including for webdataset?

Thanks @davidbuniat. (1). The use of native dataloader is: https://github.com/pytorch/examples/blob/main/imagenet/main.py

image

Running cmd: python main.py -a resnet18 --workers 16 --b 256 dir dir is the path of imagenet dataset,
which can be downloaded from https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar (2). The use of dl dataloader is referred from tutorial: https://datasets.activeloop.ai/docs/ml/datasets/imagenet-dataset/

import deeplake
#ds = deeplake.load('hub://activeloop/cifar10-train')
ds = deeplake.load('/home/data/dataset_cifar/')
keys = ds.tensors.keys()    # dict_keys(['images', 'labels'])

label_num = ds.labels[0].numpy() # array([6], dtype=uint32)

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
tform = transforms.Compose([
    transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    # transforms.RandomRotation(20), # Image augmentation
    transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
    transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),

])
train_loader = ds.pytorch(num_workers=16, batch_size=256, transform={
                    'images': tform, 'labels': None}, shuffle=True)

Replace train_loader with the original train_loader in https://github.com/pytorch/examples/blob/main/imagenet/main.py

(3) All my dataset is stored in local object storage called chubaofs. chubaofs may not be available to you, so you can test this code in local storage. I have tested deeplake in local storage, we get the similar performance.

GxjGit commented 1 year ago

@mikayelh Hello, Have you looked into my example, any progress on troubleshooting?

mikayelh commented 1 year ago

hi @GxjGit, I know @AbhinavTuli has been working on this. @AbhinavTuli can you provide us with an interim update? cc: @tatevikh

davidbuniat commented 1 year ago

Thanks for your patience @GxjGit, we are still looking into the issue. The root reason for slowdown is not identified, but some insights that we are looking into on our side.

GxjGit commented 1 year ago

Thanks for your update @davidbuniat, the following conclusions are different from yours. (1) Cifar 10 roughly ~580 on my side (with transform), with new experimental dataloader. (2) In the new experimental dataloader streaming from hub remote storage (hub://activeloop/imagenet-train) getting 350 images/s (both with transform and without transform). image

without transform: image

kisseternity commented 1 year ago

Hello, I'm also intrested in the high performance of Deep Lake in experiments in the paper Deep Lake: a Lakehouse for Deep Learning. I wonder if there is experiment code in this repository so that I can reproduce the experiment C as the pic shown in the paper? Thanks very much! image

mikayelh commented 1 year ago

hey @kisseternity , i'm not sure it's public, tagging @levongh to share more on this subject. There's also a wide array of benchmarks in this third-party review by Yale researchers, but I couldn't find the code in arxiv. https://arxiv.org/abs/2209.13705 (most of these are from there).

levongh commented 1 year ago

@GxjGit currently we are working to identify the root cause of the problem

levongh commented 1 year ago

Hello, I'm also intrested in the high performance of Deep Lake in experiments in the paper Deep Lake: a Lakehouse for Deep Learning. I wonder if there is experiment code in this repository so that I can reproduce the experiment C as the pic shown in the paper? Thanks very much! image

hey @kisseternity, unfortunately now, we do not have any open benchmarks repository, we will come up with one in near future.

As @mikayelh already mentioned the only open benchmarks are done by Yale researchers and the whole code base is here https://github.com/smartnets/dataloader-benchmarks

thecooltechguy commented 1 year ago

I'm also facing poor performance when using Deep Lake, compared to Webdataset (which is really fast). My use-case is around MP3 audio files. I'll try to create a reproducible setup for this.

davidbuniat commented 1 year ago

I'm also facing poor performance when using Deep Lake, compared to Webdataset (which is really fast). My use-case is around MP3 audio files. I'll try to create a reproducible setup for this.

yes, reproducible script would help us a lot and we gonna make sure that each use case is optimized, thanks for your help :)

levongh commented 1 year ago

@GxjGit Can you please also provide your system i formation, so we will identify and resolve the performance issues that you can see?

GxjGit commented 1 year ago

@GxjGit Can you please also provide your system i formation, so we will identify and resolve the performance issues that you can see?

Hi, @levongh The information is as following:

image

GxjGit commented 1 year ago

Hi, @levongh can this problem be fixed in the near future? We are considering whether to use this solution according to the plan to solve this problem.

AbhinavTuli commented 1 year ago

Hey @GxjGit! We have a solution for the slowdown, it will be included in the next release before the end of this week.

AbhinavTuli commented 1 year ago

Hey @GxjGit! We just released deeplake==3.0.12 which greatly improves performance when dealing with jpeg and png data. My numbers while streaming imagenet data from activeloop storage(hub://activeloop/imagenet-train) to a c5.4xlarge ec2 instance in the same region (us-east-1) were close to 1700-1800 images per second using the experimental dataloader. One thing to keep in mind is that now the data sent to transform for jpeg and png tensors will be a PIL image, so ToPILImage transform should be removed.

Looking forward to hearing your results after the update!

GxjGit commented 1 year ago

Hi @AbhinavTuli , thank you for updating. I have upgraded the deeplake version to 3.0.11. But encountered an error: image

My code is as following:

from deeplake.experimental import dataloader

ds = deeplake.load('/home/code/deeplake/dataset_imaget')
keys = ds.tensors.keys()    # dict_keys(['images', 'labels'])
label_num = ds.labels[0].numpy() # array([6], dtype=uint32)
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                              std=[0.229, 0.224, 0.225])
tform = transforms.Compose([
    #transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    # transforms.RandomRotation(20), # Image augmentation
    transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
    transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

train_loader = dataloader(ds)\
            .transform(tform)\
            .batch(256)\
            .shuffle()\
            .pytorch(tensors=['images', 'labels'], num_workers = 8)
AbhinavTuli commented 1 year ago

Hey, thanks for the feedback. This should be a quick fix. There's an issue with our shuffle buffer + handling the new implementation of jpeg+png. We should have a minor release for this today.

In the meantime, you can modifiy the existing code to the following:- 2 changes were made, the transform syntax was fixed and shuffle was removed.


train_loader = dataloader(ds)\
            .transform({"images":tform, "labels": None})\
            .batch(256)\
            .pytorch(num_workers = 8)

Will ping you shortly once the new release with the shuffle issue is fixed. Thanks for your patience.

AbhinavTuli commented 1 year ago

Also, please use 3.0.12, not 3.0.11

GxjGit commented 1 year ago

Also, please use 3.0.12, not 3.0.11 @AbhinavTuli , ok, I have changed to version 3.0.12, and modified my code as well. It seemed that the code can run succeessfully, but another error countered:

image

levongh commented 1 year ago

@GxjGit can you please confirm if you are running the training on "imagenet-train", is the transform function the same that you provided before?

GxjGit commented 1 year ago

Hi @levongh,

  1. I ran the trainning on my local dataset, referred to https://docs.activeloop.ai/tutorials/creating-datasets/creating-object-detection-datasets . I will check if there is something wrong with my dataset.

  2. The transform function is the same that i provided before: (removed ToPILImage) tform = transforms.Compose([

    transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run

    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    # transforms.RandomRotation(20), # Image augmentation
    transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
    transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)), # Some images are grayscale, so we need to add channels
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),

    ])

AbhinavTuli commented 1 year ago

One guess I have is that perhaps one of your images is RGBA and has 4 channels, which makes transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)) return 0, 224, 224 But that should raise an error in the normalize function call itself, so we're not sure yet.

AbhinavTuli commented 1 year ago

@GxjGit we have released deeplake==3.0.13 which fixes the issue with shuffling. We're yet to reproduce the error you reported where you ended up with a tensor with 0 channels, do let us know if you find any more information to help us reproduce it.

GxjGit commented 1 year ago

One guess I have is that perhaps one of your images is RGBA and has 4 channels, which makes transforms.Lambda(lambda x: x.repeat(int(3/x.shape[0]), 1, 1)) return 0, 224, 224 But that should raise an error in the normalize function call itself, so we're not sure yet.

@davidbuniat I suspect there is something wrong with my local dataset,so I built it yesterday, but it ended up with this error:

image

image

I am not sure if this error also occurs whiling building the last dataset. The files in imagetnet are all seemed to jpeg fromat, they all should be RGB, not RGBA, I am not sure about it. I built this dataset last night with version of 3.0.5, I am updating to 3.0.14, and trying to build it again. the script is as following:


import deeplake
import numpy as np
import os
base_path = "/home//imagenet-pytorch/"
ds = deeplake.empty('/home/dataset_imaget') # Create the dataset locally
base_path1 = "/home/imagenet-pytorch/train"
images = []
labels = []
with open(os.path.join(base_path, 'meta/train_labeled.txt'), 'r') as fin:
    for line in fin:
        image, lable = line.strip().split(' ', 1)
        images.append(image)
        labels.append(int(lable))
with ds:
    ds.create_tensor('images', htype='image', sample_compression = 'jpeg')
    ds.create_tensor('labels', htype='class_label', class_names = labels)
    for image in images:        
        # Append data to tensors
        ds.append({'images': deeplake.read(os.path.join(base_path1, image)),
                   'labels': labels[i]
                   })
print("build dataset suceess.")
AbhinavTuli commented 1 year ago

@GxjGit Seems like you might have 1 or more png files that have been renamed to jpeg. One simple way to check or skip these files is to do this:-

for image in images:     
    image_sample = deeplake.read(os.path.join(base_path1, image))
    if image_sample.shape[2] == 4:
        compression = image_sample.compression
        print(f"Skipping {image} with compression {compression} because it has 4 channels.")
        continue
GxjGit commented 1 year ago

@AbhinavTuli Thanks a lot. With your tips, we rebuilt the deeplake dataset of imagenet and ran successfully. Compared to the pre version, the speed raise from 350 fps to 1230 fps. It performs well when only load and transform the data. But when I integrate it into training process, its performance drops a lot. Compare it with pytorch, webdataset.

image

deeplake version: 3.0.13 batch_size = 256. num_worker: 16

GxjGit commented 1 year ago

@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?

levongh commented 1 year ago

@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?

Hi @GxjGit, sorry for the delayed response. We are currently working on a couple of improvements and will have preliminary results by tomorrow. We will keep you updated with the updates.

levongh commented 1 year ago

@davidbuniat I reported my test results of deeplake in our department last week. Everyone is very interested in it, because we saw the potential of deeplake in data reading and processing. we are integrating it into our project. But we are still concerned about the efficiency performance of the integration into the training end-to-end process as mentioned above. Have you reproduced the issue and plan to fix it?

Hi @GxjGit, sorry for the delayed response. We are currently working on a couple of improvements and will have preliminary results by tomorrow. We will keep you updated with the updates.

We examined a few avenues of improvement but they did not materialize into meaningful speed increases. We're going to perform more detailed profiling within the next two weeks and will get back to you with the results as soon as they're done.

GxjGit commented 1 year ago

@levongh ok, Thanks for your reply.

tatevikh commented 1 year ago

Closing for now. Will reopen once we improve the benchmarks.

GxjGit commented 1 year ago

@davidbuniat @levongh @tatevikh Hi, we are still following your project. I update deep lake to version 3.2.13. Found a few problems. (1) Have you removed deeplake.experimental from new version? (2) We found that the performance in 3.2.13 in worse than 3.0.13(with deeplake.experimental), the performance seems to be back to original level. (284 images/s -> 160 images/s)

Please help me to check the above performance data is correct. And when will you improve the benchmarks?

istranic commented 1 year ago

Hey @GxjGit

Re (1) - deeplake.experimental was moved to the enterprise API, and the query and dataloader is accessed via ds.query() and ds.dataloader().pytorch()...Link.

Re (2) - Which API did you use to measure the performance. If you used ds.pytorch(), that's using the pure python dataloader. To access the C++ dataloader (previously the experimental loader), you'll want to use ds.dataloader().pytorch()....

The enterprise API is only available to users on Growth Plan or higher, which comes with a 14-day free trial. The easiest way to try it out is to sign up in our APP, and install deeplake via pip install deeplake[enterprise].

GxjGit commented 1 year ago

Hi @istranic

I want to try your enterprise API in my local env, for evaluating performance. How can i be access to it. Or does deeplake surpass pytorch native dataloader in end-to-end training resnet50? Can you show your test data for me like the following.

image

the env of our test: num_worker: 16 batch_size: 256 dataset: imagenet storage: object storage / Local storage GPU: 1 V100 cpu: 7 Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz

istranic commented 1 year ago

Hey @GxjGit

The enterprise API is only available to users on Growth Plan or higher, which comes with a 14-day free trial. The easiest way to try it out is to sign up in our APP, and install deeplake via pip install deeplake[enterprise].

We will also run the benchmark in red above, over the next week, and I expect our enterprise dataloader is similar to the pytorch dataloader in terms of performance for local dataset. Pls note that Deep Lake is not explicitly optimized to be a faster for local data. Most of our optimization are for streaming applications, so that's where you'd expect to see significant improvements over non-Deep Lake approaches.

GxjGit commented 1 year ago

Hi @istranic Thanks for reply. I am looking forward to your test results.