conv_forward: TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not list

Tagar commented 4 years ago

Spin-off from https://github.com/pytorch/pytorch/issues/44628

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

$ conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90"

import fastbook
fastbook.setup_book()
from fastbook import *

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)

learn.fine_tune(1)

produces:

Traceback (most recent call last):
  File "<command-8165228>", line 4, in <module>
    learn.fine_tune(1)
  File "../fastcore/utils.py", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/callback/schedule.py", line 161, in fine_tune
    self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
  File "../fastcore/utils.py", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "../fastcore/utils.py", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/learner.py", line 207, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "../fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/learner.py", line 197, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "../fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/learner.py", line 191, in _do_epoch
    self._do_epoch_train()
  File "../fastai/learner.py", line 183, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "../fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/learner.py", line 161, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "../fastai/learner.py", line 179, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "../fastai/learner.py", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/learner.py", line 164, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "../torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "../torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "../torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/conv.py", line 419, in forward
    return self._conv_forward(input, self.weight)
  File "../torch/nn/modules/conv.py", line 416, in _conv_forward
    self.padding, self.dilation, self.groups)
TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not list

additional debugging showed following local variables inside of _conv_forward when it failed:

{'self': Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), 
            padding=(3, 3), bias=False), 
'input': [0], 
'weight': Parameter containing:
tensor([[[[ 5.4109e-03, -6.9092e-03,  7.8839e-03,  ...,  4.9072e-02,  3.0660e-02,  2.5398e-02],
          [ 4.1081e-02,  3.1296e-02,  3.2265e-02,  ...,  3.3145e-02,  2.9754e-02,  4.1735e-02],
          [ 4.9519e-03, -3.1705e-02, -6.1310e-02,  ..., -9.7493e-02, -1.1601e-01, -1.2191e-01],
          ...,

Expected behavior

No errors expected.

Environment

PyTorch Version (e.g., 1.0): 1.6.0

List of some of the conda/pip packages -

...
fastai                    2.0.10                     py_0    fastai
fastbook                  0.0.11             pyh39e3cac_0    fastai
fastcore                  1.0.9                      py_0    fastai
fastprogress              1.0.0              pyh39e3cac_0    fastai
fastscript                1.0.0                         0    fastai
nbdev                     1.0.18                     py_0    fastai
...
pytorch                   1.6.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
torchvision               0.7.0                py37_cu101    pytorch
...
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tensorflow                2.3.0                    pypi_0    pypi
tensorflow-base           2.2.0           mkl_py37hd506778_0  
tensorflow-estimator      2.3.0                    pypi_0    pypi
...

OS (e.g., Linux): Ubuntu 18.04.5 LTS
How you installed PyTorch (conda, pip, source): conda, see above
Build command you used (if compiling from source): didn't build manually
Python version: 3.7.6
CUDA/cuDNN version: CUDA 10.1 Update 2, cuDNN 7.6.5, NCCL 2.7.3, TensorRT 6.0.1
GPU models and configuration: ec2 instance g4dn.4xlarge
Any other relevant information: Databricks ML Runtime 7.3

Tagar commented 4 years ago

Copying response from https://github.com/pytorch/pytorch/issues/44628

conv2d is called internally by fastai library.

Notice that when _conv_forward fails, it has weights as a Tensor already, and only input is a list of one single value 0, literally [0] as seen in the debug dump above.

I understand that input must have come as [0] (single-element list of zero) from fastai directly somehow. I was trying to follow the logic in both of these libraries, but couldn't completely follow how input was going through all of these functions.

Tagar commented 4 years ago

@jph00 can you please have a look at this?

Thanks!

jph00 commented 4 years ago

I can't reproduce that. I've tried running the code you provided on colab and on my own machine, and it works in both cases.

Can you see if you can find out you've got installed on your box which causes this behavior, or whether there's some other bit of code you ran first?

Tagar commented 4 years ago

@jph00 thanks for trying to reproduce this.

It looks like the issue may be in some versions or other dependencies that cause this..

I was using Databricks Machine Learning Runtime 7.3 as a baseline - here's conda spec https://docs.databricks.com/release-notes/runtime/7.3ml.html#python-on-gpu-clusters

On top of that had following for fastai components to work -

%conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90" 
%sh pip install azure-cognitiveservices-search-imagesearch

We have a number of folks in Databricks and Databricks customers who are trying to use fastai and running into this issue so it would be nice to understand root cause of this issue.

PS. here's complete list of the conda environment in Databricks where this issue consistently happens -

``` # Name Version Build Channel _libgcc_mutex 0.1 main _tflow_select 2.3.0 mkl absl-py 0.9.0 py37_0 adal 1.2.4 pypi_0 pypi argon2-cffi 20.1.0 py37h7b6447c_1 asn1crypto 1.3.0 py37_1 astor 0.8.0 py37_0 astunparse 1.6.3 py_0 attrs 20.1.0 py_0 azure-cognitiveservices-search-imagesearch 2.0.0 pypi_0 pypi azure-common 1.1.25 pypi_0 pypi azure-core 1.8.0 pypi_0 pypi azure-storage-blob 12.4.0 pypi_0 pypi backcall 0.1.0 py37_0 backports 1.0 py_2 bcrypt 3.2.0 py37h7b6447c_0 blas 1.0 mkl bleach 3.1.5 py_0 blinker 1.4 py37_0 boto3 1.12.0 py_0 botocore 1.15.0 py_0 c-ares 1.15.0 h7b6447c_1001 ca-certificates 2020.7.22 0 cachetools 4.1.1 py_0 cairo 1.14.12 h8948797_3 catalogue 1.0.0 py37_1 certifi 2020.6.20 py37_0 cffi 1.14.0 py37h2e261b9_0 chardet 3.0.4 py37_1003 click 7.0 py37_0 cloudpickle 1.3.0 py_0 configparser 3.7.4 py37_0 cryptography 2.8 py37h1ba5d50_0 cudatoolkit 10.1.243 h6bb024c_0 cycler 0.10.0 py37_0 cymem 2.0.3 py37he6710b0_0 cython 0.29.15 py37he6710b0_0 cython-blis 0.4.1 py37h7b6447c_1 databricks-cli 0.11.0 pypi_0 pypi dbus 1.13.16 hb2f20db_0 decorator 4.4.1 py_0 defusedxml 0.6.0 py_0 dill 0.3.1.1 py37_1 diskcache 5.0.2 pypi_0 pypi docker 4.3.1 pypi_0 pypi docutils 0.15.2 py37_0 entrypoints 0.3 py37_0 expat 2.2.9 he6710b0_2 fastai 2.0.11 py_0 fastai fastbook 0.0.11 pyh39e3cac_0 fastai fastcore 1.0.11 py_0 fastai fastprogress 1.0.0 pyh39e3cac_0 fastai fastscript 1.0.0 0 fastai flask 1.1.1 py_1 fontconfig 2.13.0 h9420a91_0 freetype 2.9.1 h8a8886c_1 fribidi 1.0.10 h7b6447c_0 future 0.18.2 py37_1 gast 0.3.3 py_0 gitdb 4.0.5 py_0 gitpython 3.1.0 py_0 glib 2.63.1 h5a9c865_0 google-auth 1.11.2 py_0 google-auth-oauthlib 0.4.1 py_2 google-pasta 0.2.0 py_0 gorilla 0.3.0 pypi_0 pypi graphite2 1.3.14 h23475e2_0 graphviz 2.40.1 h21bd128_2 grpcio 1.27.2 py37hf8bcb03_0 gst-plugins-base 1.14.0 hbbd80ab_1 gstreamer 1.14.0 hb453b48_1 gunicorn 20.0.4 py37_0 h5py 2.10.0 py37h7918eee_0 harfbuzz 1.8.8 hffaf4a1_0 hdf5 1.10.4 hb1b8bf9_0 horovod 0.19.5 pypi_0 pypi icu 58.2 he6710b0_3 idna 2.8 py37_0 importlib-metadata 1.7.0 py37_0 importlib_metadata 1.7.0 0 intel-openmp 2020.0 166 ipykernel 5.1.4 py37h39e3cac_0 ipython 7.12.0 py37h5ca1d4c_0 ipython_genutils 0.2.0 py37_0 ipywidgets 7.5.1 py_0 isodate 0.6.0 py_1 itsdangerous 1.1.0 py37_0 jedi 0.14.1 py37_0 jinja2 2.11.1 py_0 jmespath 0.10.0 py_0 joblib 0.14.1 py_0 joblibspark 0.2.0 pypi_0 pypi jpeg 9b h024ee3a_2 jsonschema 3.0.2 py37_0 jupyter_client 5.3.4 py37_0 jupyter_core 4.6.1 py37_0 keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.1.0 py37he6710b0_0 koalas 1.2.0 pypi_0 pypi krb5 1.16.4 h173b8e3_0 ld_impl_linux-64 2.33.1 h53a641e_7 libedit 3.1.20181209 hc058e9b_0 libffi 3.2.1 hd88cf55_4 libgcc-ng 9.1.0 hdf63c60_0 libgfortran-ng 7.3.0 hdf63c60_0 libpng 1.6.37 hbc83047_0 libpq 11.2 h20c2e04_0 libprotobuf 3.11.4 hd408876_0 libsodium 1.0.16 h1bed415_0 libstdcxx-ng 9.1.0 hdf63c60_0 libtiff 4.1.0 h2733197_0 libuuid 1.0.3 h1bed415_2 libxcb 1.14 h7b6447c_0 libxml2 2.9.9 hea5a465_1 lightgbm 2.3.0 py37he6710b0_0 lz4-c 1.8.1.2 h14c3975_0 mako 1.1.2 py_0 markdown 3.1.1 py37_0 markupsafe 1.1.1 py37h14c3975_1 matplotlib 3.1.3 py37_0 matplotlib-base 3.1.3 py37hef1b27d_0 mistune 0.8.4 py37h14c3975_1001 mkl 2020.0 166 mkl-service 2.3.0 py37he904b0f_0 mkl_fft 1.0.15 py37ha843d7b_0 mkl_random 1.1.0 py37hd6b4f25_0 mleap 0.16.1 pypi_0 pypi mlflow 1.11.0 pypi_0 pypi msrest 0.6.18 pypi_0 pypi msrestazure 0.6.4 pypi_0 pypi murmurhash 1.0.2 py37he6710b0_0 nb_conda 2.2.1 py37_0 nb_conda_kernels 2.2.4 py37_0 nbconvert 5.6.1 py37_1 nbdev 1.0.18 py_0 fastai nbformat 5.0.7 py_0 ncurses 6.2 he6710b0_1 networkx 2.4 py_1 ninja 1.10.0 py37hfd86e86_0 nltk 3.4.5 py37_0 notebook 6.1.1 py37_0 numpy 1.18.1 py37h4f9e942_0 numpy-base 1.18.1 py37hde5b4d6_1 oauthlib 3.1.0 py_0 olefile 0.46 py37_0 openssl 1.1.1g h7b6447c_0 opt-einsum 3.3.0 pypi_0 pypi opt_einsum 3.1.0 py_0 packaging 20.1 py_0 pandas 1.0.1 py37h0573a6f_0 pandoc 2.10.1 0 pandocfilters 1.4.2 py37_1 pango 1.42.4 h049681c_0 paramiko 2.7.1 py_0 parso 0.5.2 py_0 patsy 0.5.1 py37_0 pcre 8.44 he6710b0_0 petastorm 0.9.5 pypi_0 pypi pexpect 4.8.0 py37_1 pickleshare 0.7.5 py37_1001 pillow 7.0.0 py37hb39fc2d_0 pip 20.0.2 py37_3 pixman 0.40.0 h7b6447c_0 plac 0.9.6 py37_1 plotly 4.9.0 py_0 preshed 3.0.2 py37he6710b0_1 prometheus_client 0.8.0 py_0 prompt_toolkit 3.0.3 py_0 protobuf 3.11.4 py37he6710b0_0 psutil 5.6.7 py37h7b6447c_0 psycopg2 2.8.4 py37h1ba5d50_0 ptyprocess 0.6.0 py37_0 pyarrow 1.0.1 pypi_0 pypi pyasn1 0.4.8 py_0 pyasn1-modules 0.2.7 py_0 pycparser 2.19 py37_0 pygments 2.5.2 py_0 pyjwt 1.7.1 py37_0 pynacl 1.3.0 py37h7b6447c_0 pyodbc 4.0.30 py37he6710b0_0 pyopenssl 19.1.0 py_1 pyparsing 2.4.6 py_0 pyqt 5.9.2 py37h05f1152_2 pyrsistent 0.16.0 py37h7b6447c_0 pysocks 1.7.1 py37_1 python 3.7.6 h0371630_2 python-dateutil 2.8.1 py_0 python-editor 1.0.4 py_0 python-graphviz 0.14 py_0 pytorch 1.6.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch pytz 2019.3 py_0 pyyaml 5.3.1 pypi_0 pypi pyzmq 18.1.1 py37he6710b0_0 qt 5.9.7 h5867ecd_1 querystring-parser 1.2.4 pypi_0 pypi readline 7.0 h7b6447c_5 requests 2.22.0 py37_1 requests-oauthlib 1.3.0 py_0 retrying 1.3.3 py37_2 rsa 4.0 py_0 s3transfer 0.3.3 py37_1 scikit-learn 0.22.1 py37hd81dba3_0 scipy 1.4.1 py37h0b6359f_0 seaborn 0.10.0 pypi_0 pypi send2trash 1.5.0 py37_0 sentencepiece 0.1.85 pypi_0 pypi setuptools 45.2.0 py37_0 simplejson 3.17.0 py37h7b6447c_0 sip 4.19.8 py37hf484d3e_0 six 1.14.0 py37_0 smmap 3.0.4 py_0 spacy 2.3.1 py37hfd86e86_0 spark-tensorflow-distributor 0.1.0 pypi_0 pypi sqlite 3.31.1 h62c20be_1 sqlparse 0.3.0 py_0 srsly 1.0.2 py37he6710b0_0 statsmodels 0.11.0 py37h7b6447c_0 tabulate 0.8.3 py37_0 tensorboard 2.3.0 pypi_0 pypi tensorboard-plugin-wit 1.7.0 pypi_0 pypi tensorflow 2.3.0 pypi_0 pypi tensorflow-base 2.2.0 mkl_py37hd506778_0 tensorflow-estimator 2.3.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi terminado 0.8.3 py37_0 testpath 0.4.4 py_0 thinc 7.4.1 py37hfd86e86_0 tk 8.6.8 hbc83047_0 torchvision 0.7.0 py37_cu101 pytorch tornado 6.0.3 py37h7b6447c_3 tqdm 4.42.1 py_0 traitlets 4.3.3 py37_0 unixodbc 2.3.7 h14c3975_0 urllib3 1.25.8 py37_0 wasabi 0.8.0 py_0 wcwidth 0.1.8 py_0 webencodings 0.5.1 py37_1 websocket-client 0.56.0 py37_0 werkzeug 1.0.0 py_0 wheel 0.34.2 py37_0 widgetsnbextension 3.5.1 py37_0 wrapt 1.11.2 py37h7b6447c_0 xgboost 1.1.1 pypi_0 pypi xz 5.2.4 h14c3975_4 yaml 0.2.5 h7b6447c_0 zeromq 4.3.1 he6710b0_3 zipp 3.1.0 py_0 zlib 1.2.11 h7b6447c_3 zstd 1.3.7 h0b5b093_0 ```

jph00 commented 4 years ago

It would be great to fix, I agree.

Could you try creating a new conda env, and see if you still have the problem? If not, could you try installing a few of the extra libs or different versions you have in the broken env, to track down where the issue is coming from?

Tagar commented 4 years ago

The conda environment is not broken per se. I tried many times and it consistently fails with this exception. Many of those versions come standard on that particular version of Databricks runtime - Machine Learning Runtime (MLR) 7.3 for GPU. I tried different MLR versions and none of them work. Are there some known compatibility issues in fastai? Any of the above package versions are much newer / much older than what you would expect to see or what you normally test fastai against?

Response I got from PyTorch developers @mariosasko @albanD in pytorch/pytorch#44628 fastai doesn't use pytorch library in that case correctly as fit as input has [0] (single-element list with just 0 in it) while it has to have a tensor.

jph00 commented 4 years ago

This is the first time that this issue has been reported. There's no known compatibility issues. The only way I think we can debug it is by following the steps I requested in my previous reply.

Tagar commented 4 years ago

The issue is on Databricks side and is related to multiprocessing. The workaround is to set num_workers=0 in DataLoaders.from_name_func. We will have a look how to solve this. Thank you for everyone's help.

jph00 commented 4 years ago

Do let us know if you figure out the solution, in case we see similar reports in the future.

Tagar commented 4 years ago

@jph00 absolutely

cc @mengxr

SoulEvill commented 4 years ago

Thanks for sharing! I was using the databricks for fastai course too and had the same issues, I was able to run it fine after set num_workers = 0.

Tagar commented 4 years ago

@SoulEvill thanks for letting us know. We hope we can fix this in the next release - MLR 7.4 so wouldn't need num_workers=0 then.

jph00 commented 4 years ago

Have you tracked down what the source issue was?

Tagar commented 4 years ago

@jph00 from what I understand, there are multiple issues. That's a fix for one of them - https://github.com/pytorch/pytorch/pull/45870 @mengxr can comment here better

fastai / fastai