sif4sci may return None;Similarly,GensimWordTokenizer may return None, ethier.
Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)
To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
import json
from EduNLP.SIF import sif4sci, is_sif, to_sif
def load_items2():
items = []
with open("OpenLUNA.json", encoding="utf-8") as f:
for line in f:
items.append( json.loads(line))
return items
items = load_items2()
# ----------------------------------------- #
tokenization_params1 = {
"formula_params": {
"method": "linear",
"symbolize_figure_formula": True
}
}
tokenizer = GensimWordTokenizer(symbol="fgm")
# ----------------------------------------- #
wrong_num = 0
for item in items:
res = sif4sci(item["stem"], symbol="gm", tokenization_params=tokenization_params1, errors="ignore")
# res = tokenizer(item["stem"])
if res is None:
wrong_num += 1
print(f"There are {wrong_num} / {len(items)} wrong cases!")
# There are 156 / 792 wrong cases!
What have you tried to solve it?
Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.
But, as I look at the specific error, I find one main type related to SIF Parser. So I wonder if we need to handle this problem ?
For example, Parser can not identify "n=" and "p="
🐛 Description
sif4sci
may return None;Similarly,GensimWordTokenizer
may return None, ethier.Error Message
(Paste the complete error message. Please also include stack trace by setting environment variable
DMLC_LOG_STACK_TRACE_DEPTH=100
before running your script.)To Reproduce
(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)
Steps to reproduce
(Paste the commands you ran that produced the error.)
What have you tried to solve it?
Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.
But, as I look at the specific error, I find one main type related to
SIF Parser
. So I wonder if we need to handle this problem ?For example, Parser can not identify
"n="
and"p="
(1)
Environment
Operating System: windows
Python Version: Pyhon 3.6
Additional context