conv_forward: TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not list #2794

Closed Tagar closed 4 years ago

Tagar commented 4 years ago

Spin-off from

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

$ conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90"

import fastbook
from fastbook import *

from import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)



Traceback (most recent call last):
  File "<command-8165228>", line 4, in <module>
  File "../fastcore/", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/callback/", line 161, in fine_tune
    self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
  File "../fastcore/", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/callback/", line 113, in fit_one_cycle, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "../fastcore/", line 473, in _f
    return inst if to_return else f(*args, **kwargs)
  File "../fastai/", line 207, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "../fastai/", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/", line 197, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "../fastai/", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/", line 191, in _do_epoch
  File "../fastai/", line 183, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "../fastai/", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/", line 161, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "../fastai/", line 179, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "../fastai/", line 155, in _with_events
    try:       self(f'before_{event_type}')       ;f()
  File "../fastai/", line 164, in _do_one_batch
    self.pred = self.model(*self.xb)
  File "../torch/nn/modules/", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/", line 117, in forward
    input = module(input)
  File "../torch/nn/modules/", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/", line 117, in forward
    input = module(input)
  File "../torch/nn/modules/", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "../torch/nn/modules/", line 419, in forward
    return self._conv_forward(input, self.weight)
  File "../torch/nn/modules/", line 416, in _conv_forward
    self.padding, self.dilation, self.groups)
TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not list

additional debugging showed following local variables inside of _conv_forward when it failed:

{'self': Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), 
            padding=(3, 3), bias=False), 
'input': [0], 
'weight': Parameter containing:
tensor([[[[ 5.4109e-03, -6.9092e-03,  7.8839e-03,  ...,  4.9072e-02,  3.0660e-02,  2.5398e-02],
          [ 4.1081e-02,  3.1296e-02,  3.2265e-02,  ...,  3.3145e-02,  2.9754e-02,  4.1735e-02],
          [ 4.9519e-03, -3.1705e-02, -6.1310e-02,  ..., -9.7493e-02, -1.1601e-01, -1.2191e-01],

Expected behavior

No errors expected.


List of some of the conda/pip packages -

fastai                    2.0.10                     py_0    fastai
fastbook                  0.0.11             pyh39e3cac_0    fastai
fastcore                  1.0.9                      py_0    fastai
fastprogress              1.0.0              pyh39e3cac_0    fastai
fastscript                1.0.0                         0    fastai
nbdev                     1.0.18                     py_0    fastai
pytorch                   1.6.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
torchvision               0.7.0                py37_cu101    pytorch
tensorboard               2.3.0                    pypi_0    pypi
tensorboard-plugin-wit    1.7.0                    pypi_0    pypi
tensorflow                2.3.0                    pypi_0    pypi
tensorflow-base           2.2.0           mkl_py37hd506778_0  
tensorflow-estimator      2.3.0                    pypi_0    pypi
Tagar commented 4 years ago

Copying response from

conv2d is called internally by fastai library.

Notice that when _conv_forward fails, it has weights as a Tensor already, and only input is a list of one single value 0, literally [0] as seen in the debug dump above.

I understand that input must have come as [0] (single-element list of zero) from fastai directly somehow. I was trying to follow the logic in both of these libraries, but couldn't completely follow how input was going through all of these functions.

Tagar commented 4 years ago

@jph00 can you please have a look at this?


jph00 commented 4 years ago

I can't reproduce that. I've tried running the code you provided on colab and on my own machine, and it works in both cases.

Can you see if you can find out you've got installed on your box which causes this behavior, or whether there's some other bit of code you ran first?

Tagar commented 4 years ago

@jph00 thanks for trying to reproduce this.

It looks like the issue may be in some versions or other dependencies that cause this..

I was using Databricks Machine Learning Runtime 7.3 as a baseline - here's conda spec

On top of that had following for fastai components to work -

%conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90" 
%sh pip install azure-cognitiveservices-search-imagesearch

We have a number of folks in Databricks and Databricks customers who are trying to use fastai and running into this issue so it would be nice to understand root cause of this issue.

jph00 commented 4 years ago

It would be great to fix, I agree.

Could you try creating a new conda env, and see if you still have the problem? If not, could you try installing a few of the extra libs or different versions you have in the broken env, to track down where the issue is coming from?

Tagar commented 4 years ago

The conda environment is not broken per se. I tried many times and it consistently fails with this exception. Many of those versions come standard on that particular version of Databricks runtime - Machine Learning Runtime (MLR) 7.3 for GPU. I tried different MLR versions and none of them work. Are there some known compatibility issues in fastai? Any of the above package versions are much newer / much older than what you would expect to see or what you normally test fastai against?

Response I got from PyTorch developers @mariosasko @albanD in pytorch/pytorch#44628 fastai doesn't use pytorch library in that case correctly as fit as input has [0] (single-element list with just 0 in it) while it has to have a tensor.

jph00 commented 4 years ago

This is the first time that this issue has been reported. There's no known compatibility issues. The only way I think we can debug it is by following the steps I requested in my previous reply.

Tagar commented 4 years ago

The issue is on Databricks side and is related to multiprocessing. The workaround is to set num_workers=0 in DataLoaders.from_name_func. We will have a look how to solve this. Thank you for everyone's help.

jph00 commented 4 years ago

Do let us know if you figure out the solution, in case we see similar reports in the future.

Tagar commented 4 years ago

@jph00 absolutely

cc @mengxr

SoulEvill commented 4 years ago

Thanks for sharing! I was using the databricks for fastai course too and had the same issues, I was able to run it fine after set num_workers = 0.

Tagar commented 4 years ago

@SoulEvill thanks for letting us know. We hope we can fix this in the next release - MLR 7.4 so wouldn't need num_workers=0 then.

jph00 commented 4 years ago

Have you tracked down what the source issue was?

Tagar commented 4 years ago

@jph00 from what I understand, there are multiple issues. That's a fix for one of them - @mengxr can comment here better