Open riqiang-dp opened 1 month ago
@anand-nv could you please comment on this?
Can you provide the steps your are following to evaluate. Providing Python scripts and sparrowhawk code snippets used for benchmarking and performing ITN/TN would be useful.
For python, I'm simply initializing the text normalizer and running it in a for loop
normalizer = Normalizer(
input_case='cased',
lang='en',
whitelist='path/to/whitelist.tsv',
overwrite_cache=False,
cache_dir='./assets/'
)
for each line of text in a file:
line = normalizer.normalize(line, punct_pre_process=True, punct_post_process=True, verbose=True)
Sparrowhawk
bash export_grammars.sh --GRAMMARS=tn_grammars --LANGUAGE=en --OVERWRITE_CACHE=true --WHITELIST path/to/whitelist.tsv --INPUT_CASE=cased --MODE=interactive
and in the docker container, I replaced the test.txt with my own text and
time normalizer_main --config=sparrowhawk_configuration.ascii_proto --multi_line_text < test.txt > results.txt
I also modified normalizer_main.cc to print out the actual time taken in the loop
const auto normalize_start = std::chrono::steady_clock::now();
for (const auto& sentence : sentences) {
string output;
normalizer->Normalize(sentence, &output);
std::cout << output << std::endl;
}
const auto normalize_end = std::chrono::steady_clock::now();
const auto normalize_time = std::chrono::duration_cast<std::chrono::milliseconds>(
normalize_end - normalize_start).count();
std::cerr << "Time taken to normalize: " << normalize_time << " milliseconds" << std::endl;
Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc
?
Do you have the "actual time estimates" for the C++ implementation
normalizer_main.cc
?
I don't have the numbers / the docker container open anymore but like I said it's always around 600 ms less than the bash time. So it was about 100: 26.8s 50: 15s 10: 4s which is why I assume the init time was around 600ms
Are you using the Dockerfile provided here for building sparrowhawk. If so can you try adding 'CXXFLAGS'
and 'CFLAGS'
to ./configure
and rebuild the docker. ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w'
I see let me try, thanks
I got this error trying to compile:
79.22 libtool: link: g++ -g -O2 -w -std=c++11 -o .libs/normalizer_main normalizer_main.o ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -l
re2
79.29 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) c
onst'
79.29 collect2: error: ld returned 1 exit status
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Describe the bug
As you guys suggested, I tried exporting the grammars and run the normalizer with Sparrowhawk. But it actually takes even longer than Python. n_utts vs time_taken for Sparrowhawk: 100: 0m27.430s 50: 0m15.666s 10: 0m4.690s
For python: 100: 11s 50: 5.6s 10: 0.85s
The time taken for Sparrowhawk seems a bit non-linear.
Steps/Code to reproduce bug
Exported my custom grammar and ran the Sparrowhawk docker. There was another issue reporting this slowdown: https://github.com/NVIDIA/NeMo-text-processing/issues/82
Expected behavior
C++ supposed to be faster.
Environment overview (please complete the following information)
Environment details
Additional context