Closed Tagar closed 4 years ago
Copying response from https://github.com/pytorch/pytorch/issues/44628
conv2d is called internally by fastai
library.
Notice that when _conv_forward
fails, it has weights as a Tensor already, and only input
is a list of one single value 0, literally [0]
as seen in the debug dump above.
I understand that input
must have come as [0]
(single-element list of zero) from fastai directly somehow. I was trying to follow the logic in both of these libraries, but couldn't completely follow how input
was going through all of these functions.
@jph00 can you please have a look at this?
Thanks!
I can't reproduce that. I've tried running the code you provided on colab and on my own machine, and it works in both cases.
Can you see if you can find out you've got installed on your box which causes this behavior, or whether there's some other bit of code you ran first?
@jph00 thanks for trying to reproduce this.
It looks like the issue may be in some versions or other dependencies that cause this..
I was using Databricks Machine Learning Runtime 7.3 as a baseline - here's conda spec https://docs.databricks.com/release-notes/runtime/7.3ml.html#python-on-gpu-clusters
On top of that had following for fastai components to work -
%conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90"
%sh pip install azure-cognitiveservices-search-imagesearch
We have a number of folks in Databricks and Databricks customers who are trying to use fastai and running into this issue so it would be nice to understand root cause of this issue.
It would be great to fix, I agree.
Could you try creating a new conda env, and see if you still have the problem? If not, could you try installing a few of the extra libs or different versions you have in the broken env, to track down where the issue is coming from?
The conda environment is not broken per se. I tried many times and it consistently fails with this exception. Many of those versions come standard on that particular version of Databricks runtime - Machine Learning Runtime (MLR) 7.3 for GPU. I tried different MLR versions and none of them work. Are there some known compatibility issues in fastai? Any of the above package versions are much newer / much older than what you would expect to see or what you normally test fastai
against?
Response I got from PyTorch developers @mariosasko @albanD in pytorch/pytorch#44628
fastai doesn't use pytorch library in that case correctly as fit
as input
has [0]
(single-element list with just 0 in it) while it has to have a tensor.
This is the first time that this issue has been reported. There's no known compatibility issues. The only way I think we can debug it is by following the steps I requested in my previous reply.
The issue is on Databricks side and is related to multiprocessing. The workaround is to set num_workers=0 in DataLoaders.from_name_func. We will have a look how to solve this. Thank you for everyone's help.
Do let us know if you figure out the solution, in case we see similar reports in the future.
@jph00 absolutely
cc @mengxr
Thanks for sharing! I was using the databricks for fastai course too and had the same issues, I was able to run it fine after set num_workers = 0.
@SoulEvill thanks for letting us know. We hope we can fix this in the next release - MLR 7.4 so wouldn't need num_workers=0 then.
Have you tracked down what the source issue was?
@jph00 from what I understand, there are multiple issues. That's a fix for one of them - https://github.com/pytorch/pytorch/pull/45870 @mengxr can comment here better
Spin-off from https://github.com/pytorch/pytorch/issues/44628
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
$ conda install -c fastai -c pytorch fastai fastbook powerai::"sentencepiece<0.1.90"
produces:
additional debugging showed following local variables inside of
_conv_forward
when it failed:Expected behavior
No errors expected.
Environment
List of some of the conda/pip packages -
conda
,pip
, source): conda, see above