issues with multi-process pool inference on multi hpu Gaudi cards

rbrugaro commented 3 months ago

Code to reproduce error:

"""
This example starts multiple processes, which encode
sentences in parallel. This gives a near linear speed-up
when encoding large text collections.
"""

import logging

from sentence_transformers import LoggingHandler, SentenceTransformer

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
)

# Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == "__main__":
    # Create a large list of 400k sentences
    sentences = ["This is sentence {}".format(i) for i in range(400000)]

    # Define the model
    model = SentenceTransformer("all-MiniLM-L6-v2")

    # Start the multi-process pool on available HPU devices
    pool = model.start_multi_process_pool(["hpu", "hpu"]) # when two cards available

    # Compute the embeddings using the multi-process pool
    emb = model.encode_multi_process(sentences, pool)
    print("Embeddings computed. Shape:", emb.shape)

    # Optional: Stop the processes in the pool
    model.stop_multi_process_pool(pool)

Output when running multiprocess one card

root@idc708053:/home/share/sentence-transformers# python examples/applications/computing-embeddings/computing_embeddings_multi_hpu.py 
2024-06-25 19:59:39 - Use pytorch device_name: hpu
2024-06-25 19:59:40 - Load pretrained SentenceTransformer: all-MiniLM-L6-v2
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375232 KB
------------------------------------------------------------------------------
2024-06-25 19:59:44 - Start multi-process pool on devices: hpu
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:462: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:319: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:319: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056375232 KB
------------------------------------------------------------------------------
Embeddings computed. Shape: (400000, 384)
 **/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '**

result is correct but there is leaked semaphone and when running on multiple cards, card utilization seems wrong and even higher number of leaked semaphores.

pmpalang commented 3 months ago

The leaked semaphore issue is observed when we use all three types of devices -- CPU, CUDA, or HPU. CUDA: CPU:
I placed a breakpoint after model.stop_multi_process_pool(pool) and observed that the 'leaked semaphore' warning occurs only after the breakpoint, while the python file is being closed. Therefore, this seems beyond the scope of sentence_transformers.

HPU

We observe a reduction in the runtime as we increase the number of HPUs, which implies that the multi-HPU inference is indeed working. 1 HPU 2 HPUs

Collaborated debug efforts from @ZhengHongming888 and @rbrugaro

Thanks, Poovaiah

rbrugaro commented 3 months ago

Thanks!

UKPLab / sentence-transformers

issues with multi-process pool inference on multi hpu Gaudi cards #2780