Closed d3banjan closed 5 years ago
Hi, so we unfortunately do not officially support the generate.py script anymore as it doesn’t make sense with our most recent set of classifiers.
Also we removed the —cpu option as we try and set device automatically now.
Two other questions: 1) is your apex install from our official apex repo? Or from this repo. 2) would you mind posting the inconsistent size error you were getting? I think some of our model checkpoints might have been corrupted.
I had installed apex using pip with git+http --
pip install --install-option="--cpp_ext" git+https://github.com/NVIDIA/apex.git
/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/pip/_internal/commands/install.py:211: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options.
cmdoptions.check_install_build_global(options)
Collecting git+https://github.com/NVIDIA/apex.git
Cloning https://github.com/NVIDIA/apex.git to /tmp/pip-req-build-yvqk30va
Requirement already satisfied (use --upgrade to upgrade): apex==0.1 from git+https://github.com/NVIDIA/apex.git in /home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages
Building wheels for collected packages: apex
Running setup.py bdist_wheel for apex ... done
Stored in directory: /tmp/pip-ephem-wheel-cache-0s50p6ow/wheels/20/ef/9d/1967e1ee0ae20e7dc8e41ab7208017893b0a026243189508a3
Successfully built apex
your response alerted me to the fact that apex was already installed using setup.py
. so i recreated a fresh conda environment as follows --
conda create -c anaconda -n torch_apex_cpu_env python=3.7
conda activate torch_apex_cpu_env
conda install -c pytorch pytorch-cpu
python setup.py install
Warning: Torch did not find available GPUs on this system.
If your intention is to cross-compile, this is not an error.
torch.__version__ = 1.0.0
Building module.
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing requirements to apex.egg-info/requires.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
...
pip install numpy --upgrade # to fix "RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa"
python3 generate.py --model mLSTM --load_model ../../data/raw/mlstm.pt --neuron 2388 --visualize
Creating mlstm
Traceback (most recent call last):
File "generate.py", line 90, in <module>
sd = torch.load(f)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
return _load(f, map_location, pickle_module)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
result = unpickler.load()
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
data_type(size), location)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
result = fn(storage, location)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 94, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 78, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
the problem is reproduced with the in-repo apex installation as well.
I edited generate.py
as follows -
$ git diff generate.py
diff --git a/generate.py b/generate.py
index 94c1d2d..eca80a2 100644
--- a/generate.py
+++ b/generate.py
@@ -87,7 +87,7 @@ if args.cuda:
if args.fp16:
model.half()
with open(args.load_model, 'rb') as f:
- sd = torch.load(f)
+ sd = torch.load(f, map_location='cpu')
try:
model.load_state_dict(sd)
except:
then the size mismatches appear as I reported earlier --
python3 generate.py --model mLSTM --load_model ../../data/raw/mlstm.pt --neuron 2388 --visualize
Creating mlstm
Traceback (most recent call last):
File "generate.py", line 92, in <module>
model.load_state_dict(sd)
File "/media/debanjan/WORK-SD/projects/sentiment-discovery/model/model.py", line 56, in load_state_dict
self.decoder.load_state_dict(state_dict['decoder'], strict=strict)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([257, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]).
size mismatch for bias: copying a param with shape torch.Size([257]) from checkpoint, the shape in current model is torch.Size([256]).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "generate.py", line 95, in <module>
model.load_state_dict(sd)
File "/media/debanjan/WORK-SD/projects/sentiment-discovery/model/model.py", line 56, in load_state_dict
self.decoder.load_state_dict(state_dict['decoder'], strict=strict)
File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Linear:
size mismatch for weight: copying a param with shape torch.Size([257, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]).
size mismatch for bias: copying a param with shape torch.Size([257]) from checkpoint, the shape in current model is torch.Size([256]).
i.e the problem is the 256/257 mismatch in the saved model and the expected dimension.
Thanks for alerting us to the apex install problem, I'll try and get a fix out for that.
As for the mismatch problem, you'll notice that in our pretraining script we set the data_size to be equivalent to tokenizer.num_tokens. This is because our tokenizer has some extra tokens for padding.
In future updates we'll be releasing an embedding data structure that manages embedding sizes from the number of tokens for you automatically so you don't have to worry about this.
The apex install conflicts should be fixed now. I don't intend to add any support for generate.py (especially with transformers), but I do hope whatever troubles you had have been resolved.
I tried the following command.
I also tried using
model.cpu()
whentorch.cuda.is_available()
isFalse
. I also tried usingload
withmap_location='cpu'
... which led to inconsistencies in tensor/ndarray sizes.PS: I didn't find a
--cpu
option in the docs. Others have discussed running a model on the CPU - but I didn't find anything else. PPS: I am usingpytorch-cpu
version0.4.1=py36_cpu_1
from the condapytorch
channel.