MadryLab / datamodels

Apache License 2.0
22 stars 2 forks source link

error running example script "AttributeError: 'str' object has no attribute 'type'" #2

Closed paullintilhac closed 2 months ago

paullintilhac commented 3 months ago

I have tried to run the example in this repo both on my university's own slurm cluster and on google colab, and I keep ending up with the same error:

AttributeError: 'str' object has no attribute 'type'

I was able to edit one of the python package source files directly in order to get rid of this error, but then it predictably gave me another error RuntimeError: No HIP GPUs are available. That one I'm not sure how to solve. So it raises the question of why this error is happening in the first place.

Steps to reproduce (the file system shown here is what I used for colab, but you can replace the paths with whatever download directory you use on whatever system you have):

#install dependencies
git clone https://github.com/MadryLab/datamodels.git
cd datamodels
pip install fastargs
pip install terminaltables
wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
tar xjf parallel-latest.tar.bz2
cd /content/datamodels/parallel-20240622
./configure && make
make install
parallel --version
cd /content/datamodels
apt update && apt install -y --no-install-recommends libopencv-dev libturbojpeg-dev
cp -f /usr/lib/x86_64-linux-gnu/pkgconfig/opencv.pc /usr/lib/x86_64-linux-gnu/pkgconfig/opencv4.pc
pip install mosaicml ffcv numba opencv-python
import torch
pip install cupy-cuda12x
from typing import List

#download dataset
import torch as ch
import torchvision

from ffcv.fields import IntField, RGBImageField
from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder
from ffcv.loader import Loader, OrderOption
from ffcv.pipeline.operation import Operation
from ffcv.transforms import RandomHorizontalFlip, Cutout, \
    RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage
from ffcv.transforms.common import Squeeze
from ffcv.writer import DatasetWriter
datasets = {
    'train': torchvision.datasets.CIFAR10('/content', train=True, download=True),
    'test': torchvision.datasets.CIFAR10('/content', train=False, download=True)
}

for (name, ds) in datasets.items():
    writer = DatasetWriter(f'/content/cifar_{name}.beton', {
        'image': RGBImageField(),
        'label': IntField()
    })
    writer.from_indexed_dataset(ds)
bash examples/cifar10/example.sh

I have tried many different conda environments, including with python 3.8 (as the repo suggests), and 3.9, cuda 12.1 and 12.2, and rocm 6 and 5.4. All of them give me one of the two above errors.

Any idea how I can get around this? Full stack trace:


(0x(B Parameter                (0x(B Value                               (0x(B
(0tqqqqqqqqqqqqqqqqqqqqqqqqqqnqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu(B
(0x(B worker.index             (0x(B 1                                   (0x(B
(0x(B worker.main_import       (0x(B examples.cifar10.train_cifar        (0x(B
(0x(B worker.logdir            (0x(B /tmp/10921                          (0x(B
(0x(B worker.do_if_complete    (0x(B False                               (0x(B
(0x(B worker.job_timeout       (0x(B 99999999                            (0x(B
(0x(B training.lr              (0x(B 0.5                                 (0x(B
(0x(B training.epochs          (0x(B 24                                  (0x(B
(0x(B training.lr_peak_epoch   (0x(B 5                                   (0x(B
(0x(B training.batch_size      (0x(B 512                                 (0x(B
(0x(B training.momentum        (0x(B 0.9                                 (0x(B
(0x(B training.weight_decay    (0x(B 0.0005                              (0x(B
(0x(B training.label_smoothing (0x(B 0.1                                 (0x(B
(0x(B training.num_workers     (0x(B 1                                   (0x(B
(0x(B training.lr_tta          (0x(B True                                (0x(B
(0x(B data.train_dataset       (0x(B /content/cifar_train.beton          (0x(B
(0x(B data.val_dataset         (0x(B /content/cifar-ffcv/cifar_val.beton (0x(B
(0mqqqqqqqqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj(B
logging in /tmp/10921
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/content/datamodels/datamodels/training/worker.py", line 109, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/content/datamodels/datamodels/training/worker.py", line 105, in main
    status = do_index(routine=routine)
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/content/datamodels/datamodels/training/worker.py", line 84, in do_index
    to_log = routine(index=index, logdir=str(worker_logs))
  File "/content/datamodels/examples/cifar10/train_cifar.py", line 185, in main
    loaders = make_dataloaders(mask=np.nonzero(mask)[0])
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 63, in result
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/fastargs/decorators.py", line 35, in __call__
    return self.func(*args, **filled_args)
  File "/content/datamodels/examples/cifar10/train_cifar.py", line 76, in make_dataloaders
    loaders[name] = Loader(paths[name], indices=(mask if name == 'train' else None),
  File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/loader.py", line 210, in __init__
    self.generate_code()
  File "/usr/local/lib/python3.10/dist-packages/ffcv/loader/loader.py", line 275, in generate_code
    queries, code = self.graph.collect_requirements()
  File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements
    self.collect_requirements(next_state, node, allocations, code, source_field=source_field)
  File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements
    self.collect_requirements(next_state, node, allocations, code, source_field=source_field)
  File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 352, in collect_requirements
    self.collect_requirements(next_state, node, allocations, code, source_field=source_field)
  [Previous line repeated 3 more times]
  File "/usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py", line 333, in collect_requirements
    if next_state.device.type != 'cuda' and isinstance(operation,
AttributeError: 'str' object has no attribute 'type'```
paullintilhac commented 3 months ago

Quick follow-up on this. When I make the edit to the file referenced above, /usr/local/lib/python3.10/dist-packages/ffcv/pipeline/graph.py, by changing line 333 from if next_state.device.type != 'cuda' to

if next_state.device != 'cuda:0', I can get it to run on google colab.

Is anyone else experiencing this error with ffcv when trying to run the training code?

paullintilhac commented 3 months ago

created a corresponding issue in the ffcv repository here: https://github.com/libffcv/ffcv/issues/380

kristian-georgiev commented 2 months ago

You need to replace "cuda:0" with ch.device("cuda:0") here https://github.com/MadryLab/datamodels/blob/61e590a6d857b31b6b11be10800f7c9bba6b400e/examples/cifar10/train_cifar.py#L58 and here https://github.com/MadryLab/datamodels/blob/61e590a6d857b31b6b11be10800f7c9bba6b400e/examples/cifar10/train_cifar.py#L68.

paullintilhac commented 2 months ago

Thank you! that works. Do you know what was the underlying cause of the issue?

kristian-georgiev commented 2 months ago

I believe it was an update in ffcv.