NVIDIA / NeMo-text-processing

NeMo text processing for ASR and TTS
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/text_normalization/wfst/wfst_text_normalization.html
Apache License 2.0
246 stars 80 forks source link

Zh tn #50

Closed BuyuanCui closed 11 months ago

BuyuanCui commented 1 year ago

What does this PR do?

Changes are made to the ZH_TN from outside contributor. Updates to the existing grammars and alignments applied to keep the consistency with the ZH_ITN Grammar. 1) Cardinal grammar is separated into two grammars, cardinal and decimal. Decimal being an independent class. 2) Increased cardinal grammar coverage up to hundred billion. 3) Added ordinal grammar that works based on the cardinal grammar by processing a morpheme that indicates the order "第". 4) Update on the date grammar, not processing inputs with only two of the components of year, month, and date. For example, 2002/02, 02/11 are not accepted. The reason is these input formats are not idea according to the national guideline (http://www.zgzlyx.com/uploadfile/news_images/zlyx/2022-05-26/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%9B%BD%E5%AE%B6%E6%A0%87%E5%87%86%E2%80%94%E2%80%94%E5%87%BA%E7%89%88%E7%89%A9%E4%B8%8A%E6%95%B0%E5%AD%97%E7%94%A8%E6%B3%95%EF%BC%88GB%EF%BC%8FT%2015835-2011%EF%BC%89.pdf). 5) Update to time grammar include expressions that are limited to the format of 'hour: minute: second'. The grammar can also process Inputs like "5点6分". Another update processes time expressed to mean a range of time. Some of the sample expressions are "五个小时", "5秒钟" and "五个钟头". 6) Update to fraction grammar include expressions that are in percentages, for example, "50%" or "百分之五十". 7) Update to money grammar to process expressions involving units like "块", "毛", and "分"。Large money expressed in decimal format, for example, "1.5万美元". Also including expressions where the currency is not in symbol format for are in Mandarins, for example, ”¥15“ vs. "15人民币". 8) Did not include measure, math and preprocessor. After discussing with the team, the plan is to align the classes to the ITN grammar, so only cardinal, ordinal, fraction, decimal, time, date, and money are included.

Add a one line overview of what this PR aims to accomplish. Update the cardinal, ordinal, decimal, fraction, time, date, and money grammar, and remove math, measure, and preprocessors.

Before your PR is "Ready for review"

Pre checks:

PR Type:

If you haven't finished some of the above items you can still open "Draft" PR.

yzhang123 commented 1 year ago

@fayejf could you pls review this PR?

fayejf commented 1 year ago

@yzhang123 Sure!

github-actions[bot] commented 1 year ago

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This PR was closed because it has been inactive for 7 days since being marked as stale.

BuyuanCui commented 12 months ago

Re-ran PR. at: https://github.com/NVIDIA/NeMo-text-processing/pull/89 Due to rebase and conflict issues. I ran the PR again.

ekmb commented 12 months ago

@BuyuanCui could this be closed?

github-actions[bot] commented 11 months ago

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This PR was closed because it has been inactive for 7 days since being marked as stale.