bigdata-ustc / EduNLP

A library for advanced Natural Language Processing towards multi-modal educational items.
Apache License 2.0
51 stars 18 forks source link

[Bug] some hidden error when using sif4sci or GensimWordTokenizer #114

Open KenelmQLH opened 2 years ago

KenelmQLH commented 2 years ago

🐛 Description

sif4sci may return None;Similarly,GensimWordTokenizer may return None, ethier.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=100 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

import json
from EduNLP.SIF import sif4sci, is_sif, to_sif

def load_items2():
  items = []
  with open("OpenLUNA.json", encoding="utf-8") as f:
      for line in f:
          items.append( json.loads(line))
  return items

items = load_items2()

# ----------------------------------------- #
tokenization_params1 = {
  "formula_params": {
    "method": "linear",
    "symbolize_figure_formula": True
  }
}

tokenizer = GensimWordTokenizer(symbol="fgm")

# ----------------------------------------- #
wrong_num = 0
for item in items:  
  res = sif4sci(item["stem"], symbol="gm", tokenization_params=tokenization_params1, errors="ignore")
  # res = tokenizer(item["stem"])

  if res is None:
    wrong_num += 1

print(f"There are {wrong_num} / {len(items)} wrong cases!")
# There are 156 / 792 wrong cases!

What have you tried to solve it?

Actually, I figure out that this is caused by our way to hangle Error raised, which is "ignore" in GensimWordTokenizer.

But, as I look at the specific error, I find one main type related to SIF Parser. So I wonder if we need to handle this problem ?

For example, Parser can not identify "n=" and "p="

(1)

s1 = "执行右面的程序框图,则输出的n=$\\FigureID{3bf20b93-8af1-11eb-b205-b46bfc50aa29}$$\\FigureID{59b88b3f-8af1-11eb-9450-b46bfc50aa29}$$\\FigureID{63116570-8b75-11eb-b694-b46bfc50aa29}$$\\FigureID{6a006177-8b76-11eb-9ac0-b46bfc50aa29}$$\\FigureID{088f15e9-8b7c-11eb-959f-b46bfc50aa29}$"
is_sif(s1)
RecursionError                            Traceback (most recent call last)
<ipython-input-3-a8de420882df> in <module>
     11 
     12 # ----------------------------------------- #
---> 13 is_sif(s1)
     14 
     15 # ----------------------------------------- #

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\sif.py in is_sif(item, check_formula, return_parser)
     50     """
     51     item_parser = Parser(item, check_formula)
---> 52     item_parser.description_list()
     53     if item_parser.fomula_illegal_flag:
     54         raise ValueError(item_parser.fomula_illegal_message)

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description_list(self)
    344         """
    345         # print('call description_list')
--> 346         self.description()
    347         if self.error_flag:
    348             # print("Error")

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in description(self)
    304         #         if self.error_flag:
    305         #             return
--> 306         self.txt_list()
    307         if self.error_flag:
    308             return

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

... last 1 frames repeated, from the frame below ...

e:\workustc\edunlp\workmaster\edunlp\EduNLP\SIF\parser\parser.py in txt_list(self)
    298             return
    299         if self.lookahead != self.empty:
--> 300             self.txt_list()
    301 
    302     def description(self):

RecursionError: maximum recursion depth exceeded in comparison

Environment

Operating System: windows

Python Version: Pyhon 3.6

Additional context

tswsxk commented 2 years ago

Yes, I think we should handle it