Closed BuyuanCui closed 11 months ago
@fayejf could you pls review this PR?
@yzhang123 Sure!
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
Re-ran PR. at: https://github.com/NVIDIA/NeMo-text-processing/pull/89 Due to rebase and conflict issues. I ran the PR again.
@BuyuanCui could this be closed?
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
What does this PR do?
Changes are made to the ZH_TN from outside contributor. Updates to the existing grammars and alignments applied to keep the consistency with the ZH_ITN Grammar. 1) Cardinal grammar is separated into two grammars, cardinal and decimal. Decimal being an independent class. 2) Increased cardinal grammar coverage up to hundred billion. 3) Added ordinal grammar that works based on the cardinal grammar by processing a morpheme that indicates the order "第". 4) Update on the date grammar, not processing inputs with only two of the components of year, month, and date. For example, 2002/02, 02/11 are not accepted. The reason is these input formats are not idea according to the national guideline (http://www.zgzlyx.com/uploadfile/news_images/zlyx/2022-05-26/%E4%B8%AD%E5%8D%8E%E4%BA%BA%E6%B0%91%E5%85%B1%E5%92%8C%E5%9B%BD%E5%9B%BD%E5%AE%B6%E6%A0%87%E5%87%86%E2%80%94%E2%80%94%E5%87%BA%E7%89%88%E7%89%A9%E4%B8%8A%E6%95%B0%E5%AD%97%E7%94%A8%E6%B3%95%EF%BC%88GB%EF%BC%8FT%2015835-2011%EF%BC%89.pdf). 5) Update to time grammar include expressions that are limited to the format of 'hour: minute: second'. The grammar can also process Inputs like "5点6分". Another update processes time expressed to mean a range of time. Some of the sample expressions are "五个小时", "5秒钟" and "五个钟头". 6) Update to fraction grammar include expressions that are in percentages, for example, "50%" or "百分之五十". 7) Update to money grammar to process expressions involving units like "块", "毛", and "分"。Large money expressed in decimal format, for example, "1.5万美元". Also including expressions where the currency is not in symbol format for are in Mandarins, for example, ”¥15“ vs. "15人民币". 8) Did not include measure, math and preprocessor. After discussing with the team, the plan is to align the classes to the ITN grammar, so only cardinal, ordinal, fraction, decimal, time, date, and money are included.
Add a one line overview of what this PR aims to accomplish. Update the cardinal, ordinal, decimal, fraction, time, date, and money grammar, and remove math, measure, and preprocessors.
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
). 2) Sparrowhawk testsbash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.