NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.04k stars 2.51k forks source link

Nemo Citrinet model fails to deploy to Riva 1.10.0-beta #3933

Closed vinhngx closed 2 years ago

vinhngx commented 2 years ago

Describe the bug This Nemo model: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_citrinet_1024

When build and deploy with Riva 1.10.0-beta (offline config) fails with:

2022-04-05 05:08:47,705 [INFO] Writing Riva model repository to '/data/models'...
2022-04-05 05:08:47,705 [INFO] The riva model repo target directory is /data/models
2022-04-05 05:09:02,391 [INFO] Using tensorrt
2022-04-05 05:09:02,396 [INFO] Extract_binaries for nn -> /data/models/riva-trt-my_speech_service-offline-2-am-streaming-offline/1
2022-04-05 05:09:02,396 [INFO] extracting {'onnx': ('nemo.collections.asr.models.ctc_bpe_models.EncDecCTCModelBPE', 'model_graph.onnx')} -> /data/models/riva-trt-my_speech_service-offline-2-am-streaming-offline/1
2022-04-05 05:09:16,155 [INFO] Printing copied artifacts:
2022-04-05 05:09:16,155 [INFO] {'onnx': '/data/models/riva-trt-my_speech_service-offline-2-am-streaming-offline/1/model_graph.onnx'}
2022-04-05 05:09:16,155 [INFO] Building TRT engine from ONNX file
[W] colored module is not installed, will not use colors when logging. To enable colors, please install the colored module: python3 -m pip install colored
[W] 'Shape tensor cast elision' routine failed with: None
[libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING /home/jenkins/agent/workspace/OSS/OSS_L0_MergeRequest/oss/build/third_party.protobuf/src/third_party.protobuf/src/google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 572097855
[04/05/2022-05:09:19] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:02] [TRT] [W] Output type must be INT32 for shape outputs
[04/05/2022-05:10:06] [TRT] [E] 4: [shapeCompiler.cpp::evaluateShapeChecks::832] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: condition '==' violated. Where_35: dimensions not compatible for select)
2022-04-05 05:10:06,610 [ERROR] Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/cli/deploy.py", line 100, in deploy_from_rmir
    generator.serialize_to_disk(
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 398, in serialize_to_disk
    module.serialize_to_disk(repo_dir, rmir, config_only, verbose, overwrite)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 282, in serialize_to_disk
    self.update_binary(version_dir, rmir, verbose)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/asr.py", line 133, in update_binary
    RivaTRTConfigGenerator.update_binary_from_copied(self, version_dir, rmir, copied, verbose)
  File "/usr/local/lib/python3.8/dist-packages/servicemaker/triton/triton.py", line 688, in update_binary_from_copied
    with self.build_trt_engine_from_onnx(model_weights) as engine, open(
AttributeError: __enter__

Steps/Code to reproduce bug

  1. Download the above model from NGC. Convert to Riva with: nemo2riva --out my_speech_service-2.riva stt_en_citrinet_1024.nemo Using docker container: nvcr.io/nvidia/nemo:22.01

  2. Build for offline deployment using docker container: nvcr.io/nvidia/riva/riva-speech:1.10.0-beta-servicemaker

    riva-build speech_recognition -f \
    /servicemaker-dev/stt_en_citrinet_1024_v1.0.0rc1/my_speech_service-offline-2.rmir    /servicemaker-dev/stt_en_citrinet_1024_v1.0.0rc1/my_speech_service-2.riva \
    --offline \
    --name=my_speech_service-offline-2 \
    --ms_per_timestep=80 \
    --featurizer.use_utterance_norm_params=False \
    --featurizer.precalc_norm_time_steps=0 \
    --featurizer.precalc_norm_params=False \
    --chunk_size=900 \
    --left_padding_size=0. \
    --right_padding_size=0. \
    --decoder_type=flashlight \
    --flashlight_decoder.asr_model_delay=-1 \
    --decoding_language_model_binary=/myworkspace/speechtotext_en_us_lm_vdeployable_v1.1/riva_asr_train_datasets_3gram.binary \
    --decoding_vocab=/myworkspace/speechtotext_en_us_lm_vdeployable_v1.1/flashlight_decoder_vocab.txt \
    --flashlight_decoder.lm_weight=0.2 \
    --flashlight_decoder.word_insertion_score=0.2 \
    --flashlight_decoder.beam_threshold=20. \
    --language_code=en-US 
  3. Riva deploy using docker container: nvcr.io/nvidia/riva/riva-speech:1.10.0-beta-servicemaker:

    riva-deploy -f /servicemaker-dev/stt_en_citrinet_1024_v1.0.0rc1/my_speech_service-offline-2.rmir /data/models

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. GPU: V100 32GB

vinhngx commented 2 years ago

A work-around was suggested by Patrice C. to add --max-dim=100000: nemo2riva --out my_speech_service-2.riva stt_en_citrinet_1024.nemo --max-dim=100000

titu1994 commented 2 years ago

@borisfom can that 100000 flag be set programmatically in Nemo for certain models ?

neso613 commented 2 years ago

@titu1994 I am working with Mandarin Citinet Pretrain model. I am able to convert this to rmir and models.

My question is for its inference scipt. I need to chnage the langugae_code- image

en-US is for English. What it would be for Mandarin.. zh? Please help.

vinhngx commented 2 years ago

I think functionally, the language code can be pretty much anything you like, just be consistent when you build the Riva service and when you make the inference call. But to adhere to convention, maybe one of zh-CN, zh-TW, zh-CHT, zh-CHS...

Riva uses it to match a capitalization & punctuation model with the same language code. To call a Riva custom acoustic model you'll have to specify the model name in the request though.

You can also explicitly select which ASR model to use by setting the model field of the RecognitionConfig protobuf object to the value of <pipeline_name> which was used with the riva-build command. This enables you to deploy multiple ASR pipelines concurrently and select which one to use at runtime.
neso613 commented 2 years ago

Thanks @vinhngx for the information.

neso613 commented 2 years ago

@titu1994 @vinhngx I have another doubt/error.

I have created models/ using riva-deploy command, https://docs.nvidia.com/deeplearning/riva/user-guide/docs/model-overview.html?highlight=enemo#using-riva-deploy-and-riva-speech-container-advanced

In the above link, point 2,, I have followed - sudo docker run --runtime=nvidia -it --rm -e NVIDIA_VISIBLE_DEVICES=0 -v /path_of_local_models_folder/:/data -p 50051 --name riva-speech nvcr.io/nvidia/riva/riva-speech:2.1.0-server start-riva --riva-uri=0.0.0.0:50051 --nlp_service=false --asr_service=true --tts_service=false

But it stuck at this logs only - I0513 08:08:20.884277 97 server.cc:592] +------------------------------------------------+---------+--------+ | Model | Version | Status | +------------------------------------------------+---------+--------+ | riva-asr | 1 | READY | | riva-asr-ctc-decoder-cpu-streaming | 1 | READY | | riva-asr-feature-extractor-streaming | 1 | READY | | riva-asr-voice-activity-detector-ctc-streaming | 1 | READY | | riva-trt-riva-asr-am-streaming | 1 | READY | +------------------------------------------------+---------+--------+

I0513 08:08:20.936965 97 metrics.cc:623] Collecting metrics for GPU 0: A100-SXM4-40GB I0513 08:08:20.938043 97 tritonserver.cc:1932] +----------------------------------+--------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.19.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_con | | | figuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /data/models | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 1000000000 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+--------------------------------------------------------------------------------------------------------+

I0513 08:08:20.940017 97 grpc_server.cc:4375] Started GRPCInferenceService at 0.0.0.0:8001 I0513 08:08:20.940303 97 http_server.cc:3075] Started HTTPService at 0.0.0.0:8000 I0513 08:08:20.981602 97 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Triton server is ready... I0513 08:08:21.016841 175 riva_server.cc:118] Using Insecure Server Credentials I0513 08:08:21.020602 175 model_registry.cc:112] Successfully registered: riva-asr for ASR W0513 08:08:21.032874 175 grpc_riva_asr.cc:188] riva-asr has no configured wfst normalizer model I0513 08:08:21.033236 175 riva_server.cc:158] Riva Conversational AI Server listening on 0.0.0.0:50051 W0513 08:08:21.033278 175 stats_reporter.cc:41] No API key provided. Stats reporting disabled.

Am I missing some?

itzsimpl commented 2 years ago

This seems to be ok, I don't see any error log. The penultimate line states that Riva Conversational AI Server is listening on port 50051. In the next step you can use the transcribe_file.py from the Riva Quickstart examples folder, or write your own grpc client to communicate with Riva.

neso613 commented 2 years ago

okay, thanks @itzsimpl.

neso613 commented 2 years ago

okay, thanks @itzsimpl.

But one more question- in the config.sh.. I found this, en-US used twice,Is there punctional model available for mandarin. image @titu1994

itzsimpl commented 2 years ago

No, see https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html. Note that the latest version of Riva is 2.1.0.

neso613 commented 2 years ago

No, see https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-overview.html. Note that the latest version of Riva is 2.1.0. got it. Thnks