JiaquanYe / TableMASTER-mmocr

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.
Apache License 2.0
442 stars 104 forks source link

cel l内容标签里有“<i> </i><strike><overline>”等怎么没有过滤掉呢,看代码之过滤掉了<b></b>,但是其他不需要处理? #19

Open cqray1990 opened 3 years ago

cqray1990 commented 3 years ago

cel l内容标签里有“ ”等怎么没有过滤掉呢,看代码之过滤掉了,但是其他不需要处理? def remove_Bb(self, content): """ This function will remove the '' and '' of the content. :param content: [list]. text content of each cell. :return: text content without '' and ''. """ if '' in content: content.remove('') if '' in content: content.remove('') return content

delveintodetail commented 3 years ago

In the competition, we tried to use all rules (we can figure out) to improve the performance, some useful rules were missed due to that we only spent less than two months on this competition.

JiaquanYe commented 3 years ago

As we known, the content in thead will have “b/b" whether the image text is bold or not. So we need to filter out the text-line images of the thead. If there is "b/b" in the remaining tbody content, the text of the picture is bold. The pattern you mentioned above may not cause ambiguity, so we do not filter. Of course,you can modify the post-processing rules to get a higher score.