how do i use a pre-trained model on CPU?

d3banjan commented 5 years ago

I tried the following command.

python3 generate.py --model mLSTM --load_model mlstm.pt --neuron 2388 --visualize

Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.
Warning:  apex was installed without --cuda_ext. Fused syncbn kernels will be unavailable.  Python fallbacks will be used instead.
Warning:  apex was installed without --cuda_ext.  FusedAdam will be unavailable.
Warning:  apex was installed without --cuda_ext.  FusedLayerNorm will be unavailable.
/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
Traceback (most recent call last):
  File "generate.py", line 90, in <module>
    sd = torch.load(f)
  File "/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/torch/serialization.py", line 358, in load
    return _load(f, map_location, pickle_module)
  File "/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/torch/serialization.py", line 542, in _load
    result = unpickler.load()
  File "/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/torch/serialization.py", line 508, in persistent_load
    data_type(size), location)
  File "/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/torch/serialization.py", line 104, in default_restore_location
    result = fn(storage, location)
  File "/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/torch/serialization.py", line 75, in _cuda_deserialize
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

I also tried using model.cpu() when torch.cuda.is_available() is False. I also tried using load with map_location='cpu' ... which led to inconsistencies in tensor/ndarray sizes.

PS: I didn't find a --cpu option in the docs. Others have discussed running a model on the CPU - but I didn't find anything else. PPS: I am using pytorch-cpu version 0.4.1=py36_cpu_1 from the conda pytorch channel.

raulpuric commented 5 years ago

Hi, so we unfortunately do not officially support the generate.py script anymore as it doesn’t make sense with our most recent set of classifiers.

Also we removed the —cpu option as we try and set device automatically now.

Two other questions: 1) is your apex install from our official apex repo? Or from this repo. 2) would you mind posting the inconsistent size error you were getting? I think some of our model checkpoints might have been corrupted.

d3banjan commented 5 years ago

1

I had installed apex using pip with git+http --

pip install --install-option="--cpp_ext" git+https://github.com/NVIDIA/apex.git
/home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages/pip/_internal/commands/install.py:211: UserWarning: Disabling all use of wheels due to the use of --build-options / --global-options / --install-options.
  cmdoptions.check_install_build_global(options)
Collecting git+https://github.com/NVIDIA/apex.git
  Cloning https://github.com/NVIDIA/apex.git to /tmp/pip-req-build-yvqk30va
Requirement already satisfied (use --upgrade to upgrade): apex==0.1 from git+https://github.com/NVIDIA/apex.git in /home/debanjan/miniconda3/envs/dsenv/lib/python3.6/site-packages
Building wheels for collected packages: apex
  Running setup.py bdist_wheel for apex ... done
  Stored in directory: /tmp/pip-ephem-wheel-cache-0s50p6ow/wheels/20/ef/9d/1967e1ee0ae20e7dc8e41ab7208017893b0a026243189508a3
Successfully built apex

follow up

your response alerted me to the fact that apex was already installed using setup.py. so i recreated a fresh conda environment as follows --

conda create -c anaconda -n torch_apex_cpu_env python=3.7
conda activate torch_apex_cpu_env
conda install -c pytorch pytorch-cpu
python setup.py install 

Warning: Torch did not find available GPUs on this system.
 If your intention is to cross-compile, this is not an error.
torch.__version__  =  1.0.0
Building module.
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing requirements to apex.egg-info/requires.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
...

pip install numpy --upgrade # to fix "RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa"

python3 generate.py --model mLSTM --load_model ../../data/raw/mlstm.pt --neuron 2388 --visualize
Creating mlstm
Traceback (most recent call last):
  File "generate.py", line 90, in <module>
    sd = torch.load(f)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 367, in load
    return _load(f, map_location, pickle_module)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 538, in _load
    result = unpickler.load()
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 504, in persistent_load
    data_type(size), location)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 113, in default_restore_location
    result = fn(storage, location)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 94, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/serialization.py", line 78, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.

the problem is reproduced with the in-repo apex installation as well.

2

I edited generate.py as follows -

$ git diff generate.py 

diff --git a/generate.py b/generate.py
index 94c1d2d..eca80a2 100644
--- a/generate.py
+++ b/generate.py
@@ -87,7 +87,7 @@ if args.cuda:
 if args.fp16:
     model.half()
 with open(args.load_model, 'rb') as f:
-    sd = torch.load(f)
+    sd = torch.load(f, map_location='cpu')
 try:
     model.load_state_dict(sd)
 except:

then the size mismatches appear as I reported earlier --

python3 generate.py --model mLSTM --load_model ../../data/raw/mlstm.pt --neuron 2388 --visualize
Creating mlstm
Traceback (most recent call last):
  File "generate.py", line 92, in <module>
    model.load_state_dict(sd)
  File "/media/debanjan/WORK-SD/projects/sentiment-discovery/model/model.py", line 56, in load_state_dict
    self.decoder.load_state_dict(state_dict['decoder'], strict=strict)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Linear:
    size mismatch for weight: copying a param with shape torch.Size([257, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]).
    size mismatch for bias: copying a param with shape torch.Size([257]) from checkpoint, the shape in current model is torch.Size([256]).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate.py", line 95, in <module>
    model.load_state_dict(sd)
  File "/media/debanjan/WORK-SD/projects/sentiment-discovery/model/model.py", line 56, in load_state_dict
    self.decoder.load_state_dict(state_dict['decoder'], strict=strict)
  File "/home/debanjan/miniconda3/envs/nvidia_cpu_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Linear:
    size mismatch for weight: copying a param with shape torch.Size([257, 4096]) from checkpoint, the shape in current model is torch.Size([256, 4096]).
    size mismatch for bias: copying a param with shape torch.Size([257]) from checkpoint, the shape in current model is torch.Size([256]).

i.e the problem is the 256/257 mismatch in the saved model and the expected dimension.

raulpuric commented 5 years ago

Thanks for alerting us to the apex install problem, I'll try and get a fix out for that.

As for the mismatch problem, you'll notice that in our pretraining script we set the data_size to be equivalent to tokenizer.num_tokens. This is because our tokenizer has some extra tokens for padding.

In future updates we'll be releasing an embedding data structure that manages embedding sizes from the number of tokens for you automatically so you don't have to worry about this.

raulpuric commented 5 years ago

The apex install conflicts should be fixed now. I don't intend to add any support for generate.py (especially with transformers), but I do hope whatever troubles you had have been resolved.

NVIDIA / sentiment-discovery

how do i use a pre-trained model on CPU? #54

1

follow up

2