activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
https://activeloop.ai
Apache License 2.0
8.17k stars 626 forks source link

[BUG] ImageNet loader returns weird S3GetError when using memory_cache_size=12000 and local_cache_size=120000 #1648

Closed AntreasAntoniou closed 1 month ago

AntreasAntoniou commented 2 years ago

🐛🐛 Bug Report

I was trying to increase the performance of the dataloader by giving it more cache both in memory and disk, and somehow it backfired, as I now keep getting the following error even when I remove the cache requirements.

Note: I am using the skip/agreement branch as I am running my stuff at scale and do not want to accept all agreements every time, also, somehow the current way hub works does not play well with Hydra and the skip/agreement branch was a quick workaround by @davidbuniat to help me out

gate/datamodules/imagenet.py:96: in setup
    self.train_set = ImageNetClassificationDataset(
        self       = <gate.datamodules.imagenet.ImageNetDataModule object at 0x7f5d4253a0d0>
        stage      = 'fit'
gate/datasets/imagenet.py:46: in __init__
    self.dataset = hub.load(
        __class__  = <class 'gate.datasets.imagenet.ImageNetClassificationDataset'>
        dataset_root = 'datasets/imagenet'
        download   = False
        self       = <gate.datasets.imagenet.ImageNetClassificationDataset object at 0x7f5d4253ac10>
        set_name   = 'train'
../conda/envs/gate-env/lib/python3.8/site-packages/hub/api/dataset.py:276: in load
    return dataset_factory(
        cache_chain = <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>
        creds      = {}
        local_cache_size = 0
        memory_cache_size = 256
        path       = 'hub://activeloop/imagenet-train'
        read_only  = True
        skip_agreement = True
        storage    = <hub.core.storage.s3.S3Provider object at 0x7f5d4253aca0>
        token      = 'eyJhbGciOiJIUzUxMiIsImlhdCI6MTY0OTk5NDA2NCwiZXhwIjo0ODAzNTk0MDY0fQ.eyJpZCI6ImFudHJlYXNhbnRvbmlvdSJ9.PHUquycfAdeWpOKVaQ0t9-2tXi0gYQfBAG2kdjU2xUNqpYJLDyBYYzyPs1oZl_60z0E7hqAzuarTrgQIjar7mA'
        verbose    = True
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/dataset/__init__.py:22: in dataset_factory
    ds = clz(path=path, *args, **kwargs)
        args       = ()
        clz        = <class 'hub.core.dataset.hub_cloud_dataset.HubCloudDataset'>
        kwargs     = {'read_only': True, 'skip_agreement': True, 'storage': <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>,...6ImFudHJlYXNhbnRvbmlvdSJ9.PHUquycfAdeWpOKVaQ0t9-2tXi0gYQfBAG2kdjU2xUNqpYJLDyBYYzyPs1oZl_60z0E7hqAzuarTrgQIjar7mA', ...}
        path       = 'hub://activeloop/imagenet-train'
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/dataset/dataset.py:177: in __init__
    self._set_derived_attributes()
        d          = {'_client': None, '_ds_diff': None, '_info': None, '_locked_out': False, ...}
        group_index = ''
        index      = None
        is_iteration = False
        kwargs     = {}
        path       = 'hub://activeloop/imagenet-train'
        public     = False
        read_only  = True
        self       = Dataset(path='hub://activeloop/imagenet-train', read_only=True, tensors=['images'])
        skip_agreement = True
        storage    = <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>
        token      = 'eyJhbGciOiJIUzUxMiIsImlhdCI6MTY0OTk5NDA2NCwiZXhwIjo0ODAzNTk0MDY0fQ.eyJpZCI6ImFudHJlYXNhbnRvbmlvdSJ9.PHUquycfAdeWpOKVaQ0t9-2tXi0gYQfBAG2kdjU2xUNqpYJLDyBYYzyPs1oZl_60z0E7hqAzuarTrgQIjar7mA'
        verbose    = True
        version_state = None
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/dataset/dataset.py:1190: in _set_derived_attributes
    self._populate_meta()  # TODO: use the same scheme as `load_info`
        self       = Dataset(path='hub://activeloop/imagenet-train', read_only=True, tensors=['images'])
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/dataset/dataset.py:976: in _populate_meta
    load_meta(self)
        self       = Dataset(path='hub://activeloop/imagenet-train', read_only=True, tensors=['images'])
../conda/envs/gate-env/lib/python3.8/site-packages/hub/util/version_control.py:506: in load_meta
    _tensors[tensor_name] = Tensor(tensor_name, dataset)
        Tensor     = <class 'hub.core.tensor.Tensor'>
        _tensors   = {'images': Tensor(key='images')}
        dataset    = Dataset(path='hub://activeloop/imagenet-train', read_only=True, tensors=['images'])
        meta       = <hub.core.meta.dataset_meta.DatasetMeta object at 0x7f5d41b2eeb0>
        meta_key   = 'versions/896199c043b49f598410896c5d2621459acbd7b5/dataset_meta.json'
        storage    = <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>
        tensor_name = 'labels'
        version_state = {'branch': 'main', 'branch_commit_map': {'main': '896199c043b49f598410896c5d2621459acbd7b5'}, 'commit_id': '896199c043...', 'commit_node': Commit : 896199c043b49f598410896c5d2621459acbd7b5 (main)
Author : None
Time   :
Message: None, ...}
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/tensor.py:216: in __init__
    if not self.is_iteration and not tensor_exists(
        chunk_engine = None
        dataset    = Dataset(path='hub://activeloop/imagenet-train', read_only=True, tensors=['images'])
        index      = None
        is_iteration = False
        key        = 'labels'
        self       = Tensor(key='labels')
../conda/envs/gate-env/lib/python3.8/site-packages/hub/util/keys.py:157: in tensor_exists
    storage[get_tensor_meta_key(key, commit_id)]
        commit_id  = '896199c043b49f598410896c5d2621459acbd7b5'
        key        = 'labels'
        storage    = <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>
../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/storage/lru_cache.py:167: in __getitem__
    result = self.next_storage[path]
        path       = 'versions/896199c043b49f598410896c5d2621459acbd7b5/labels/tensor_meta.json'
        self       = <hub.core.storage.lru_cache.LRUCache object at 0x7f5d41b2e340>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <hub.core.storage.s3.S3Provider object at 0x7f5d4253aca0>
path = 'protected/activeloop/imagenet-train/versions/896199c043b49f598410896c5d2621459acbd7b5/labels/tensor_meta.json'

    def __getitem__(self, path):
        """Gets the object present at the path.

        Args:
            path (str): the path relative to the root of the S3Provider.

        Returns:
            bytes: The bytes of the object present at the path.

        Raises:
            KeyError: If an object is not found at the path.
            S3GetError: Any other error other than KeyError while retrieving the object.
        """
        self._check_update_creds()
        path = "".join((self.path, path))
        try:
            return self._get(path)
        except botocore.exceptions.ClientError as err:
            if err.response["Error"]["Code"] == "NoSuchKey":
                raise KeyError(err) from err
            reload = self.need_to_reload_creds(err)
            manager = S3ReloadCredentialsManager if reload else S3ResetClientManager
            with manager(self, S3GetError):
                return self._get(path)
        except CONNECTION_ERRORS as err:
            tries = self.num_tries
            for i in range(1, tries + 1):
                warnings.warn(f"Encountered connection error, retry {i} out of {tries}")
                try:
                    return self._get(path)
                except Exception:
                    pass
>           raise S3GetError(err) from err
E           hub.util.exceptions.S3GetError: An error occurred while reading from response stream: ('Connection broken: IncompleteRead(0 bytes read, 331 more expected)', IncompleteRead(0 bytes read, 331 more expected))

i          = 1
path       = 'protected/activeloop/imagenet-train/versions/896199c043b49f598410896c5d2621459acbd7b5/labels/tensor_meta.json'
self       = <hub.core.storage.s3.S3Provider object at 0x7f5d4253aca0>
tries      = 1

../conda/envs/gate-env/lib/python3.8/site-packages/hub/core/storage/s3.py:224: S3GetError

⚗️ Current Behavior

It seems that using large cache sizes breaks the dataloader.

Input Code

set_name = 'train'
self.dataset_path = f"hub://activeloop/imagenet-{set_name}"
self.dataset = hub.load(
self.dataset_path,
token=os.environ.get("HUB_AUTH_TOKEN"),
skip_agreement=True,
memory_cache_size=12000,
local_cache_size=120000,
)

Expected behavior/code The dataloader should work, and use the cache to speed things up in later epochs

`

⚙️ Environment

🖼 Additional context/Screenshots (optional)

Add any other context about the problem here. If applicable, add screenshots to help explain.

AbhinavTuli commented 2 years ago

Hey @AntreasAntoniou, I'm looking into the issue currently but the behavior is indeed very weird. Could you try out other datasets, maybe different splits of imagenet, and let me know if the behaviour persists?

There could also be a potential memory leak that's preventing s3 from reading data, could you share the output of htop after running this?

activesoull commented 1 month ago

Closing as the original issue was fixed and the argument conflicting issue was fixed with PR https://github.com/activeloopai/deeplake/pull/2954