NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
242 stars 76 forks source link

Sparrowhawk slower than Python implementation #174

Open riqiang-dp opened 1 month ago

riqiang-dp commented 1 month ago

Describe the bug

As you guys suggested, I tried exporting the grammars and run the normalizer with Sparrowhawk. But it actually takes even longer than Python. n_utts vs time_taken for Sparrowhawk: 100: 0m27.430s 50: 0m15.666s 10: 0m4.690s

For python: 100: 11s 50: 5.6s 10: 0.85s

The time taken for Sparrowhawk seems a bit non-linear.

Steps/Code to reproduce bug

Exported my custom grammar and ran the Sparrowhawk docker. There was another issue reporting this slowdown: https://github.com/NVIDIA/NeMo-text-processing/issues/82

Expected behavior

C++ supposed to be faster.

Environment overview (please complete the following information)

Environment details

Additional context

ekmb commented 1 month ago

@anand-nv could you please comment on this?

anand-nv commented 1 month ago

Can you provide the steps your are following to evaluate. Providing Python scripts and sparrowhawk code snippets used for benchmarking and performing ITN/TN would be useful.

riqiang-dp commented 1 month ago

For python, I'm simply initializing the text normalizer and running it in a for loop

normalizer = Normalizer(
                input_case='cased',
                lang='en',
                whitelist='path/to/whitelist.tsv',
                overwrite_cache=False,
                cache_dir='./assets/'
            )

for each line of text in a file:

line = normalizer.normalize(line, punct_pre_process=True, punct_post_process=True, verbose=True)

Sparrowhawk

bash export_grammars.sh --GRAMMARS=tn_grammars --LANGUAGE=en --OVERWRITE_CACHE=true --WHITELIST path/to/whitelist.tsv --INPUT_CASE=cased --MODE=interactive

and in the docker container, I replaced the test.txt with my own text and

time normalizer_main --config=sparrowhawk_configuration.ascii_proto --multi_line_text < test.txt > results.txt

I also modified normalizer_main.cc to print out the actual time taken in the loop

  const auto normalize_start = std::chrono::steady_clock::now();
  for (const auto& sentence : sentences) {
    string output;
    normalizer->Normalize(sentence, &output);
    std::cout << output << std::endl;
  }
  const auto normalize_end = std::chrono::steady_clock::now();
  const auto normalize_time = std::chrono::duration_cast<std::chrono::milliseconds>(
    normalize_end - normalize_start).count();
  std::cerr << "Time taken to normalize: " << normalize_time << " milliseconds" << std::endl;
anand-nv commented 1 month ago

Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc ?

riqiang-dp commented 1 month ago

Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc ?

I don't have the numbers / the docker container open anymore but like I said it's always around 600 ms less than the bash time. So it was about 100: 26.8s 50: 15s 10: 4s which is why I assume the init time was around 600ms

anand-nv commented 1 month ago

Are you using the Dockerfile provided here for building sparrowhawk. If so can you try adding 'CXXFLAGS' and 'CFLAGS' to ./configure and rebuild the docker. ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w'

riqiang-dp commented 1 month ago

I see let me try, thanks

riqiang-dp commented 1 month ago

I got this error trying to compile:

79.22 libtool: link: g++ -g -O2 -w -std=c++11 -o .libs/normalizer_main normalizer_main.o  ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -l
re2                                                                                                                                                                                        
79.29 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) c
onst'                                                                                                                                                                                      
79.29 collect2: error: ld returned 1 exit status
github-actions[bot] commented 4 days ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.