CederGroupHub / LimeSoup

LimeSoup is a package to parse HTML or XML papers from different publishers.
MIT License
19 stars 7 forks source link

Removing reference numbers from the text #6

Closed OlgaGKononova closed 5 years ago

OlgaGKononova commented 6 years ago

ECS parsed text still contains reference number in the text. Please, remove them. Same for all other parsers.

hhaoyan commented 6 years ago

Yes, sometimes I realize the parser generates words like "objects11", which causes problems. It's better that we remove them. #4

tiagobotari commented 6 years ago

Can you find these numbers for RSC? In the case of the text does a direct reference to the number, ex: "As demonstrated in [2]". Is wise to remove it?

nicolas-mng commented 6 years ago

When you say removing reference, you're referring to string such as "(2)" and "[10]"?

tiagobotari commented 6 years ago

Yes

hhaoyan commented 6 years ago

I think they are not helpful for our tasks. And I think it's good to remove them all (any kinds of reference).

tiagobotari commented 6 years ago

Ok, Thanks! We are going to try to remove them

shaunrong commented 6 years ago

Usually, references are included in special HTML tags. In the case of ECS, these tags are "\<a ... class='xref-bibr' ..>\". While in the case of "As demonstrated in [2]", [2] is not marked up with \ tag. Since there are so few such cases, I will advocate only remove references within "\" for now. Especially in the case of "as demonstrated in [2], XXX can be synthesized with ...", losing such links may not be favorable. If we want to deal with this case, maybe build another ingredient to replace <a ... class='xref-bibr' ..> without \ tags with "PAPER" keyword, and record this link in the JSON output.

Also @tiagobotari @nicolas-mingione It will be useful to document which commits fix the issue. https://help.github.com/articles/closing-issues-using-keywords/

tiagobotari commented 6 years ago

Thank you, Nicolas, This could be an issue. Can you test that and give a return about.

nicolas-mng commented 6 years ago

My bad, there actually is a tag... I'm submitting a new version today.

tiagobotari commented 6 years ago

Good, thanks.

OlgaGKononova commented 6 years ago

I also suggest do not store references at all as paragraphs.

hhaoyan commented 5 years ago

solved 087f7a8386821d07eca112072826468c68f934e6. closing issue.