UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.36k stars 2.49k forks source link

Quantization does not work together with start_multi_process_pool #1342

Open rogeriochaves opened 2 years ago

rogeriochaves commented 2 years ago

Hello, I'm trying to have a quantized model running in multi-process, this is my model:

from sentence_transformers import SentenceTransformer
from torch.nn import Embedding, Linear
from torch.quantization import quantize_dynamic

model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
model = quantize_dynamic(model, {Linear, Embedding})
pool = model.start_multi_process_pool(['cpu', 'cpu'])

however I get this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/rchaves/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/rchaves/.pyenv/versions/3.9.6/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/Users/rchaves/.pyenv/versions/3.9.6/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 90, in rebuild_tensor
    t = torch._utils._rebuild_tensor(storage, storage_offset, size, stride)
  File "/Users/rchaves/.pyenv/versions/3.9.6/lib/python3.9/site-packages/torch/_utils.py", line 132, in _rebuild_tensor
    t = torch.tensor([], dtype=storage.dtype, device=storage.device)
NotImplementedError: Could not run 'aten::empty.memory_format' with arguments from the 'QuantizedCPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty.memory_format' is only available for these backends: [CPU, Meta, MkldnnCPU, SparseCPU, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

CPU: registered at aten/src/ATen/RegisterCPU.cpp:16286 [kernel]
Meta: registered at aten/src/ATen/RegisterMeta.cpp:9460 [kernel]
MkldnnCPU: registered at aten/src/ATen/RegisterMkldnnCPU.cpp:563 [kernel]
SparseCPU: registered at aten/src/ATen/RegisterSparseCPU.cpp:959 [kernel]
BackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:609 [kernel]
Named: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]
ADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:60 [backend fallback]
AutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
UNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradMLC: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradNestedTensor: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
AutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_4.cpp:9226 [autograd kernel]
Tracer: registered at ../torch/csrc/autograd/generated/TraceType_4.cpp:9909 [kernel]
Autocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:255 [backend fallback]
Batched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1019 [backend fallback]
VmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]

what can I do? thanks in advance!

rogeriochaves commented 2 years ago

for now I managed to work around it by redefining start_multi_process_pool and quantizing after the thread spawn:

  def start_multi_process_pool(self, target_devices):
      """
      Starts multi process to process the encoding with several, independent processes.
      This method is recommended if you want to encode on multiple GPUs. It is advised
      to start only one process per GPU. This method works together with encode_multi_process
      :param target_devices: PyTorch target devices, e.g. cuda:0, cuda:1... If None, all available CUDA devices will be used
      :return: Returns a dict with the target processes, an input queue and and output queue.
      """
      if target_devices is None:
          if torch.cuda.is_available():
              target_devices = ['cuda:{}'.format(i) for i in range(torch.cuda.device_count())]
          else:
              target_devices = ['cpu']*4

      ctx = mp.get_context('spawn')
      input_queue = ctx.Queue()
      output_queue = ctx.Queue()
      processes = []

      for cuda_id in target_devices:
          p = ctx.Process(target=_encode_multi_process_worker, args=(cuda_id, self, input_queue, output_queue), daemon=True)
          p.start()
          processes.append(p)

      return {'input': input_queue, 'output': output_queue, 'processes': processes}

  def _encode_multi_process_worker(target_device: str, model, input_queue, results_queue):
      """
      Internal working process to encode sentences in multi-process setup
      """
+     model = quantize_dynamic(model)
      while True:
          try:
              id, batch_size, sentences = input_queue.get()
              embeddings = model.encode(sentences, device=target_device, show_progress_bar=False, convert_to_numpy=True, batch_size=batch_size)
              results_queue.put([id, embeddings])
          except queue.Empty:
              break