NVIDIA-Merlin / Merlin

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.
Apache License 2.0
773 stars 118 forks source link

[BUG] TritonServer start fails to load after Initializing QueryFaiss #760

Open vs385 opened 1 year ago

vs385 commented 1 year ago

Tried running this notebook example:

When I reach the point to start the server: tritonserver --model-repository=/ensemble_export_path/ --backend-config=tensorflow,version=2 from the terminal (which I open adjacently in jupyterhub while the notebook is running), the terminal gets stuck and stops loading anything after the below lines:

2022-12-07 17:58:42.661940: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.662552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 19610 MB memory: -> device: 1, name: NVIDIA A10G, pci bus id: 0000:00:1c.0, compute capability: 8.6 2022-12-07 17:58:42.662612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.663241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 19610 MB memory: -> device: 2, name: NVIDIA A10G, pci bus id: 0000:00:1d.0, compute capability: 8.6 2022-12-07 17:58:42.663299: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:991] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-07 17:58:42.663917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 19610 MB memory: -> device: 3, name: NVIDIA A10G, pci bus id: 0000:00:1e.0, compute capability: 8.6 2022-12-07 17:58:42.672242: I tensorflow/cc/saved_model/loader.cc:230] Restoring SavedModel bundle. 2022-12-07 17:58:42.728537: I tensorflow/cc/saved_model/loader.cc:214] Running initialization op on SavedModel bundle at path: /Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/1_predicttensorflow/1/model.savedmodel 2022-12-07 17:58:42.752115: I tensorflow/cc/saved_model/loader.cc:321] SavedModel load for tags { serve }; Status: success: OK. Took 106507 microseconds. I1207 22:58:42.752287 1485 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: 0_queryfeast (GPU device 1) I1207 22:58:45.055801 1485 python_be.cc:1767] TRITONBACKEND_ModelInstanceInitialize: 2_queryfaiss (GPU device 0)

I'm running on the following: Merlin version: nacre.io/nvidia/merlin/merlin-tensorflow:22.10 Running on an ec2 g5 instance Python version: 3.8.10 Tensorflow version (GPU): tensor flow 2.9.1+nv22.8

Faiss-gpu installed: faiss 1.7.2 faiss-gpu 1.7.2

bschifferer commented 1 year ago

@vs385 do you use the notebooks from inside the container or from github?

vs385 commented 1 year ago

Hi @bschifferer, I'm using the notebook from inside the container

vs385 commented 1 year ago

Hi, just following up here @bschifferer for any help with this issue? Thanks again for looking into this

rnyak commented 1 year ago

@vs385 can you please test our latest image? merlin-tensorflow:22.12 thanks.

vs385 commented 1 year ago

I pulled the latest image @rnyak @bschifferer and seems like there's an issue with the dask_cudf version that's running some pandas import downstream

Pandas version is 1.5.3 and dask_cudf is the one installed in the base image for merlin-tensorflow 22.12

For now I manually updated the file at /usr/local/lib/python3.8/dist-packages/cudf/core/dtypes.py and changed the line from pandas.core.arrays._arrow_utils import ArrowIntervalType to from pandas.core.arrays.arrow.extension_types import ArrowIntervalType

But still getting lots of bugs running nvtabular preprocessing such as:



You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
AttributeError("'DataFrame' object has no attribute '_meta_nonempty'")```

------------------------------------------
`ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 15
     11 # import seedir as sd
     12 
     13 # External Dependencies
     14 import cupy as cp
---> 15 from dask_cudf import read_csv
     16 from dask_cuda import LocalCUDACluster
     17 from dask.distributed import Client

File /usr/local/lib/python3.8/dist-packages/dask_cudf/__init__.py:5
      1 # Copyright (c) 2018-2022, NVIDIA CORPORATION.
      3 from dask.dataframe import from_delayed
----> 5 import cudf
      6 from cudf._version import get_versions
      8 from . import backends

File /usr/local/lib/python3.8/dist-packages/cudf/__init__.py:12
      8 from numba import config as numba_config, cuda
     10 import rmm
---> 12 from cudf.api.types import dtype
     13 from cudf import api, core, datasets, testing
     14 from cudf._version import get_versions

File /usr/local/lib/python3.8/dist-packages/cudf/api/__init__.py:3
      1 # Copyright (c) 2021, NVIDIA CORPORATION.
----> 3 from cudf.api import extensions, types
      5 __all__ = ["extensions", "types"]

File /usr/local/lib/python3.8/dist-packages/cudf/api/types.py:18
     15 from pandas.api import types as pd_types
     17 import cudf
---> 18 from cudf.core.dtypes import (  # noqa: F401
     19     _BaseDtype,
     20     dtype,
     21     is_categorical_dtype,
     22     is_decimal32_dtype,
     23     is_decimal64_dtype,
     24     is_decimal128_dtype,
     25     is_decimal_dtype,
     26     is_interval_dtype,
     27     is_list_dtype,
     28     is_struct_dtype,
     29 )
     32 def is_numeric_dtype(obj):
     33     """Check whether the provided array or dtype is of a numeric dtype.
     34 
     35     Parameters
   (...)
     43         Whether or not the array or dtype is of a numeric dtype.
     44     """

File /usr/local/lib/python3.8/dist-packages/cudf/core/dtypes.py:13
     11 from pandas.api import types as pd_types
     12 from pandas.api.extensions import ExtensionDtype
---> 13 from pandas.core.arrays._arrow_utils import ArrowIntervalType
     14 from pandas.core.dtypes.dtypes import (
     15     CategoricalDtype as pd_CategoricalDtype,
     16     CategoricalDtypeType as pd_CategoricalDtypeType,
     17 )
     19 import cudf

ModuleNotFoundError: No module named 'pandas.core.arrays._arrow_utils'`
rnyak commented 1 year ago

@vs385 you would not need to manually update any files if you are using merlin docker images, but can you pls tell us where do you run this? what's your HW on ec2 g5 instance? can you print cuda-toolkit version, you can do nvcc --version? what's driver version you can share nvidia-smi output? thanks.

vs385 commented 1 year ago

Hi @rnyak: Thanks so much for assisting with this issue

I'm running this container from inside an ec2 g5 instance (g5.12xlarge)

nvidia-smi:

NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8

CUDA toolkit version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

lscpu


Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7R32
Stepping:            0
CPU MHz:             2799.884
BogoMIPS:            5599.76
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr wbnoinvd arat npt nrip_save rdpid```
rnyak commented 1 year ago

@vs385 thanks. May I ask if you could run this example nb and this one instead (dont forget to run the ETL and training notebooks), and see if you can load the models on triton? Please do not fix any file, just use the merlin-tensorflow:22.12 image as it is (you might want to create a clean instance) and write us the error messages you are getting.

vs385 commented 1 year ago

Hi @rnyak to run the former, I have to run this one first to train the dlrm model. I'm currently running same as above with just the merlin-tensorflow:22.12 image:

InvalidArgumentError                      Traceback (most recent call last)
Cell In[1], line 9
      6 from merlin.models.utils.example_utils import workflow_fit_transform
      7 from merlin.schema.tags import Tags
----> 9 import merlin.models.tf as mm
     10 from merlin.io.dataset import Dataset
     11 import tensorflow as tf

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/__init__.py:104
    102 from merlin.models.tf.models.base import BaseModel, Model, RetrievalModel, RetrievalModelV2
    103 from merlin.models.tf.models.ranking import DCNModel, DeepFMModel, DLRMModel, WideAndDeepModel
--> 104 from merlin.models.tf.models.retrieval import (
    105     MatrixFactorizationModel,
    106     MatrixFactorizationModelV2,
    107     TwoTowerModel,
    108     TwoTowerModelV2,
    109     YoutubeDNNRetrievalModel,
    110     YoutubeDNNRetrievalModelV2,
    111 )
    112 from merlin.models.tf.outputs.base import ModelOutput
    113 from merlin.models.tf.outputs.classification import BinaryOutput, CategoricalOutput

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/models/retrieval.py:22
     20 from merlin.models.tf.prediction_tasks.base import ParallelPredictionBlock, PredictionTask
     21 from merlin.models.tf.prediction_tasks.next_item import NextItemPredictionTask
---> 22 from merlin.models.tf.prediction_tasks.retrieval import ItemRetrievalTask
     23 from merlin.models.utils.schema_utils import categorical_cardinalities
     24 from merlin.schema import Schema, Tags

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py:33
     28 from merlin.models.utils import schema_utils
     29 from merlin.schema import Schema, Tags
     32 @tf.keras.utils.register_keras_serializable(package="merlin_models")
---> 33 class ItemRetrievalTask(MultiClassClassificationTask):
     34     """Prediction-task for item-retrieval.
     35 
     36     Parameters
   (...)
     61             The item retrieval prediction task
     62     """
     64     DEFAULT_LOSS = "categorical_crossentropy"

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/prediction_tasks/retrieval.py:65, in ItemRetrievalTask()
     34 """Prediction-task for item-retrieval.
     35 
     36 Parameters
   (...)
     61         The item retrieval prediction task
     62 """
     64 DEFAULT_LOSS = "categorical_crossentropy"
---> 65 DEFAULT_METRICS = TopKMetricsAggregator.default_metrics(top_ks=[10])
     67 def __init__(
     68     self,
     69     schema: Schema,
   (...)
     78     **kwargs,
     79 ):
     80     self.samplers = samplers

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py:483, in TopKMetricsAggregator.default_metrics(cls, top_ks, **kwargs)
    481 metrics: List[TopkMetric] = []
    482 for k in top_ks:
--> 483     metrics.extend([RecallAt(k), MRRAt(k), NDCGAt(k), AvgPrecisionAt(k), PrecisionAt(k)])
    484 # Using Top-k metrics aggregator provides better performance than having top-k
    485 # metrics computed separately, as prediction scores are sorted only once for all metrics
    486 aggregator = cls(*metrics)

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py:360, in RecallAt.__init__(self, k, pre_sorted, name, **kwargs)
    359 def __init__(self, k=10, pre_sorted=False, name="recall_at", **kwargs):
--> 360     super().__init__(recall_at, k=k, pre_sorted=pre_sorted, name=name, **kwargs)

File /usr/local/lib/python3.8/dist-packages/merlin/models/tf/metrics/topk.py:233, in TopkMetric.__init__(self, fn, k, pre_sorted, name, log_base, seed, **kwargs)
    231 if name is not None:
    232     name = f"{name}_{k}"
--> 233 super().__init__(name=name, **kwargs)
    234 self._fn = fn
    235 self.k = k

File /usr/local/lib/python3.8/dist-packages/keras/dtensor/utils.py:144, in inject_mesh.<locals>._wrap_function(instance, *args, **kwargs)
    142 if mesh is not None:
    143     instance._mesh = mesh
--> 144 init_method(instance, *args, **kwargs)

File /usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py:622, in Mean.__init__(self, name, dtype)
    620 @dtensor_utils.inject_mesh
    621 def __init__(self, name="mean", dtype=None):
--> 622     super().__init__(
    623         reduction=metrics_utils.Reduction.WEIGHTED_MEAN,
    624         name=name,
    625         dtype=dtype,
    626     )

File /usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py:439, in Reduce.__init__(self, reduction, name, dtype)
    437 super().__init__(name=name, dtype=dtype)
    438 self.reduction = reduction
--> 439 self.total = self.add_weight("total", initializer="zeros")
    440 if reduction in [
    441     metrics_utils.Reduction.SUM_OVER_BATCH_SIZE,
    442     metrics_utils.Reduction.WEIGHTED_MEAN,
    443 ]:
    444     self.count = self.add_weight("count", initializer="zeros")

File /usr/local/lib/python3.8/dist-packages/keras/metrics/base_metric.py:375, in Metric.add_weight(self, name, shape, aggregation, synchronization, initializer, dtype)
    372     additional_kwargs = {}
    374 with tf_utils.maybe_init_scope(layer=self):
--> 375     return super().add_weight(
    376         name=name,
    377         shape=shape,
    378         dtype=self._dtype if dtype is None else dtype,
    379         trainable=False,
    380         initializer=initializer,
    381         collections=[],
    382         synchronization=synchronization,
    383         aggregation=aggregation,
    384         **additional_kwargs,
    385     )

File /usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py:705, in Layer.add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint, use_resource, synchronization, aggregation, **kwargs)
    702 if layout:
    703     getter = functools.partial(getter, layout=layout)
--> 705 variable = self._add_variable_with_custom_getter(
    706     name=name,
    707     shape=shape,
    708     # TODO(allenl): a `make_variable` equivalent should be added as a
    709     # `Trackable` method.
    710     getter=getter,
    711     # Manage errors in Layer rather than Trackable.
    712     overwrite=True,
    713     initializer=initializer,
    714     dtype=dtype,
    715     constraint=constraint,
    716     trainable=trainable,
    717     use_resource=use_resource,
    718     collections=collections_arg,
    719     synchronization=synchronization,
    720     aggregation=aggregation,
    721     caching_device=caching_device,
    722 )
    723 if regularizer is not None:
    724     # TODO(fchollet): in the future, this should be handled at the
    725     # level of variable creation, and weight regularization losses
    726     # should be variable attributes.
    727     name_in_scope = variable.name[: variable.name.find(":")]

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/trackable/base.py:489, in Trackable._add_variable_with_custom_getter(self, name, shape, dtype, initializer, getter, overwrite, **kwargs_for_getter)
    479   if (checkpoint_initializer is not None and
    480       not (isinstance(initializer, CheckpointInitialValueCallable) and
    481            (initializer.restore_uid > checkpoint_initializer.restore_uid))):
   (...)
    486     # then we'll catch that when we call _track_trackable. So this is
    487     # "best effort" to set the initializer with the highest restore UID.
    488     initializer = checkpoint_initializer
--> 489 new_variable = getter(
    490     name=name,
    491     shape=shape,
    492     dtype=dtype,
    493     initializer=initializer,
    494     **kwargs_for_getter)
    496 # If we set an initializer and the variable processed it, tracking will not
    497 # assign again. It will add this variable to our dependencies, and if there
    498 # is a non-trivial restoration queued, it will handle that. This also
    499 # handles slot variables.
    500 if not overwrite or isinstance(new_variable, Trackable):

File /usr/local/lib/python3.8/dist-packages/keras/engine/base_layer_utils.py:134, in make_variable(name, shape, dtype, initializer, trainable, caching_device, validate_shape, constraint, use_resource, collections, synchronization, aggregation, partitioner, layout)
    127     use_resource = True
    129 if layout is None:
    130     # In theory, in `use_resource` is True and `collections` is empty
    131     # (that is to say, in TF2), we can use tf.Variable.
    132     # However, this breaks legacy (Estimator) checkpoints because
    133     # it changes variable names. Remove this when V1 is fully deprecated.
--> 134     return tf1.Variable(
    135         initial_value=init_val,
    136         name=name,
    137         trainable=trainable,
    138         caching_device=caching_device,
    139         dtype=variable_dtype,
    140         validate_shape=validate_shape,
    141         constraint=constraint,
    142         use_resource=use_resource,
    143         collections=collections,
    144         synchronization=synchronization,
    145         aggregation=aggregation,
    146         shape=variable_shape if variable_shape else None,
    147     )
    148 else:
    149     return dtensor.DVariable(
    150         initial_value=init_val,
    151         name=name,
   (...)
    160         shape=variable_shape if variable_shape else None,
    161     )

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File /usr/local/lib/python3.8/dist-packages/keras/initializers/initializers_v2.py:171, in Zeros.__call__(self, shape, dtype, **kwargs)
    167 if layout:
    168     return utils.call_with_layout(
    169         tf.zeros, layout, shape=shape, dtype=dtype
    170     )
--> 171 return tf.zeros(shape, dtype)

InvalidArgumentError: Device ordinals must be set for all virtual devices or none. But the device_ordinal is specified for 1 while previous devices didn't have any set.
vs385 commented 1 year ago

When running the latter (Merlin/tree/main/examples/getting-started-movielens)/04-Triton-Inference-with-TF.ipynb), I get a similar error:

When running the notebook for the tf model (03-Training-with-TF.ipynb), I get an error when running the cell that loads the batch (cell running [10]):

InvalidArgumentError                      Traceback (most recent call last)
Cell In[10], line 1
----> 1 batch = train_dataset_tf.peek()
      2 batch[0]

File /usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:286, in LoaderBase.peek(self)
    284 def peek(self):
    285     """Get the next batch without advancing the iterator."""
--> 286     return self._peek_next_batch()

File /usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:308, in LoaderBase._peek_next_batch(self)
    306 # get the first chunks
    307 if self._batch_itr is None:
--> 308     self._fetch_chunk()
    310 # try to iterate through existing batches
    311 try:

File /usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:298, in LoaderBase._fetch_chunk(self)
    296 if isinstance(chunks, Exception):
    297     self.stop()
--> 298     raise chunks
    299 self._batch_itr = iter(chunks)

File /usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:764, in ChunkQueue.load_chunks(self, dev)
    762 itr = iter(self.itr)
    763 if self.dataloader.device != "cpu":
--> 764     with self.dataloader._get_device_ctx(dev):
    765         self.chunk_logic(itr)
    766 else:

File /usr/lib/python3.8/contextlib.py:113, in _GeneratorContextManager.__enter__(self)
    111 del self.args, self.kwds, self.func
    112 try:
--> 113     return next(self.gen)
    114 except StopIteration:
    115     raise RuntimeError("generator didn't yield") from None

File /usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:181, in Loader._get_device_ctx(self, dev)
    170 @contextlib.contextmanager
    171 def _get_device_ctx(self, dev):
    172     # with tf.device("/device:GPU:{}".format(dev)) as tf_device:
   (...)
    178     # RuntimeErrors when exiting if two dataloaders
    179     # are running at once (e.g. train and validation)
    180     if dev != "cpu":
--> 181         yield tf.device("/GPU:" + str(dev))
    182     else:
    183         # https://www.tensorflow.org/guide/gpu#manual_device_placement
    184         yield tf.device("/device:CPU:0")

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py:5555, in device_v2(device_name)
   5553 if callable(device_name):
   5554   raise RuntimeError("tf.device does not support functions.")
-> 5555 return device(device_name)

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py:5504, in device(device_name_or_function)
   5500   if callable(device_name_or_function):
   5501     raise RuntimeError(
   5502         "tf.device does not support functions when eager execution "
   5503         "is enabled.")
-> 5504   return context.device(device_name_or_function)
   5505 elif executing_eagerly_outside_functions():
   5506   @tf_contextlib.contextmanager
   5507   def combined(device_name_or_function):

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py:2364, in device(name)
   2344 def device(name):
   2345   """Context-manager to force placement of operations and Tensors on a device.
   2346 
   2347   Example:
   (...)
   2362     Context manager for setting the device.
   2363   """
-> 2364   ensure_initialized()
   2365   return context().device(name)

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py:2159, in ensure_initialized()
   2157 def ensure_initialized():
   2158   """Initialize the context."""
-> 2159   context().ensure_initialized()

File /usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/context.py:622, in Context.ensure_initialized(self)
    618   pywrap_tfe.TFE_ContextOptionsSetRunEagerOpAsFunction(
    619       opts, self._run_eager_op_as_function)
    620   pywrap_tfe.TFE_ContextOptionsSetJitCompileRewrite(
    621       opts, self._jit_compile_rewrite)
--> 622   context_handle = pywrap_tfe.TFE_NewContext(opts)
    623 finally:
    624   pywrap_tfe.TFE_DeleteContextOptions(opts)

InvalidArgumentError: Device ordinals must be set for all virtual devices or none. But the device_ordinal is specified for 1 while previous devices didn't have any set.
rnyak commented 1 year ago

can you add these two lines on top of your notebook please, in the first cell and restart nb again and run the cells pls? let's see if this will solve the issue.

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
vs385 commented 1 year ago

Hi @rnyak yes! This solved the issue when running the base 22.12 merlin-tensorflow image Now: thing is we're building off of this image and adding some requirements that we pip install

in requirements.txt

awscli==1.27.1
bokeh==2.1.1
seedir==0.4.0
feast==0.19.4
faiss-gpu==1.7.2
requests==2.28.1
optuna==3.0.4
plotly==5.11.0
jsonschema==4.17.3

then in our Dockerfile we run the following:

COPY requirements.txt requirements.txt
RUN pip install --upgrade pip
RUN cat requirements.txt | xargs -n 1 -L 1 pip install

When I build this image and spin upon a container exactly as I'd do with the merlin-tensorflow:22.12 base image, and say try running an example that has 'import cudf' or 'import dask_cudf', I still keep getting the same error outlined here

Would you know if any of the libraries above pip installed from the requirements would conflict for some reason with the coda-toolkit built into the base 22.12 image? This is very weird

rnyak commented 1 year ago

@vs385 I believe now you are good to use merlin-tensorflow:22.12 docker image, but you are getting issues when you install the libs in requirements.txt above. may be you can do the installation one by one and see which one is breaking the cudf?

Then I'd recommend you to escalate this issue in the rapids cudf repo? https://github.com/rapidsai/cudf

vs385 commented 1 year ago

Hi @rnyak, yeah- so when trying to run the following notebooks using the base 22.12 image (without installing any of the libs), I noticed the conflict caused by installing feast

--> Installing feast<20.0 creates a conflict with dask (it uninstalls dask==2022.7.1 and installs dask==2022.1.1) --> so I have to reinstall dask==2022.7.1

But to come back to this original problem for which this thread originated, I'm still getting an error when trying to run triton server:

I0201 18:54:02.733370 2064 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: 0_queryfeast (GPU device 1)
2023-02-01 18:54:05.258837: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-01 18:54:06.989881: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.990660: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.991378: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.992097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.993065: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.993753: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.994454: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.995114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.995809: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.996470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.997145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:06.997807: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I0201 18:54:07.203147 2064 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: 2_queryfaiss (GPU device 0)
2023-02-01 18:54:09.729914: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-01 18:54:11.472684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.473436: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.474147: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.474860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.475856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.476551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.477236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.477892: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.478553: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.479223: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.479911: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-01 18:54:11.480589: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

when running tritonserver --model-repository=/Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/ --backend-config=tensorflow,version=2 basically at the queryFaiss step

rnyak commented 1 year ago

@vs385 are you running the notebooks as they are? or with your own custom datasets?

please run this and test again:

pip install dask==2022.7.1 distributed==2022.7.1

vs385 commented 1 year ago

Hi @rnyak, I tried adding those while running the notebook example, and still get the same issue where it gets stuck at the TRITONBACKEND_ModelInstanceInitialize: 2_queryfaiss (GPU device 0) step

I0202 16:25:04.567045 3446 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: 2_queryfaiss (GPU device 0)
2023-02-02 16:25:07.089032: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-02 16:25:08.815282: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.816043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.816780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.817518: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.818501: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.819179: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.819862: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.820560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.821256: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.821951: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.822636: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-02 16:25:08.823297: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
rnyak commented 1 year ago

@vs385 I believe you first run pip install dask==2022.7.1 distributed==2022.7.1 first and restarted the kernel right afterwards?

vs385 commented 1 year ago

@rnyak - so I ran the notebook as follows: Using the base merlin-tensorflow:22.12 image

This notebook is developed and tested using the latest merlin-tensorflow container from the NVIDIA NGC catalog. To find the tag for the most recently-released container, refer to the Merlin TensorFlow page.

-> I added a cell as follows (as you advised me above)

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

Then I ran the cell

# for running this example on GPU, install the following libraries
%pip install "feast<0.20" faiss-gpu

# for running this example on CPU, uncomment the following lines
# %pip install tensorflow-cpu "feast<0.20" faiss-cpu
# %pip uninstall cudf

with the line %pip install "feast<0.20" faiss-gpu uncommented

Then I added a new cell and ran it

%pip install dask==2022.7.1 distributed==2022.7.1

I then restarted the kernel and started running the notebook entirely but this time, I did not run the two cells (doing the pip installs)

Then I run the ensemble notebook (#2) and create the poc_ensemble/ and then launch a terminal and run:

tritonserver --model-repository=/Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/ --backend-config=tensorflow,version=2

Now I'm getting the below error:

I0204 19:55:12.556858 2592 pb_stub.cc:245]  Failed to initialize Python stub for auto-complete: CUDARuntimeError: cudaErrorInitializationError: initialization error

At:
  /usr/local/lib/python3.8/dist-packages/rmm/_cuda/gpu.py(101): getDeviceCount
  /usr/local/lib/python3.8/dist-packages/cudf/utils/gpu_utils.py(57): validate_setup
  /usr/local/lib/python3.8/dist-packages/cudf/__init__.py(5): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(848): exec_module
  <frozen importlib._bootstrap>(686): _load_unlocked
  <frozen importlib._bootstrap>(975): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
  /usr/local/lib/python3.8/dist-packages/merlin/core/dispatch.py(52): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(848): exec_module
  <frozen importlib._bootstrap>(686): _load_unlocked
  <frozen importlib._bootstrap>(975): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
  /usr/local/lib/python3.8/dist-packages/merlin/systems/dag/dictarray.py(21): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(848): exec_module
  <frozen importlib._bootstrap>(686): _load_unlocked
  <frozen importlib._bootstrap>(975): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
  /usr/local/lib/python3.8/dist-packages/merlin/systems/dag/__init__.py(19): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(848): exec_module
  <frozen importlib._bootstrap>(686): _load_unlocked
  <frozen importlib._bootstrap>(975): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap>(961): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
  /Merlin/examples/Building-and-deploying-multi-stage-RecSys/poc_ensemble/2_queryfaiss/1/model.py(34): <module>
  <frozen importlib._bootstrap>(219): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(848): exec_module
  <frozen importlib._bootstrap>(686): _load_unlocked
  <frozen importlib._bootstrap>(975): _find_and_load_unlocked
  <frozen importlib._bootstrap>(991): _find_and_load
rnyak commented 1 year ago

@vs385 sorry for the inconvenience. I have couple questions:

Is it possible for you to launch a EC2 instance with only a single GPU and test again?

Besides, did you test the two nb examples below again after solving your Device ordinals must be set for all virtual devices or none... error ? Let's see if you are able to do inference without faiss and feast models.. could you please run these two notebooks in order, and see if inference works for you or not?

Thanks.

vs385 commented 1 year ago

@rnyak When you run the second notebook, are you shutting off the first one? Please be sure you have free gpu memory when you run the second.

what's the GPU type and memory? a A10G? (note that since these example nbs run on single GPU, you dont need multiple GPUs).

do you see any model is loaded on triton successfully? do you see READY status on the terminal for any model?

Is it possible for you to launch a EC2 instance with only a single GPU and test again?

Besides, did you test the two nb examples below again after solving your Device ordinals must be set for all virtual devices or none... error ?

I was able to run these two nb examples successfully and launch the server :) So there might be an issue with running the ensemble with QueryFaiss and feast functionalities?

Thanks so much for helping with this

rnyak commented 1 year ago

@karlhigley @jperez999 any idea on that QueryFaiss cannot be load to Triton Server issue reported above? thanks.