Closed appledora closed 2 years ago
requested review from @martingerlach
added 4 commits
main
In GitLab by @martingerlach on Jul 15, 2022, 17:02
Commented on src/parse/elements.py line 126
Can we also catch the content of the reference? this could be just the raw html-code between and to provide the option for further processing the reference by the user without any prefiltering from our side.
added 1 commit
This would actually be auto captured by self.plaintext
attribute in the base Elements class. Although, I just made a push, where I do some preprocessing on this text (replace multiple whitespaces with a single whitespace, following https://gitlab.wikimedia.org/repos/research/copyedit/-/blob/main/utils.py#L160)
In GitLab by @martingerlach on Jul 15, 2022, 18:35
Commented on src/parse/elements.py line 126
Ok, that makes sense. I missed that there is the base class which captures that. In this case the user could check self.html_string
to investigate the content (e.g. if they wanted to extract links in the reference)?
Yes, they will have access to both the html_string and the plaintext (and all the other base class attributes too). So, if there's a link, they should be able to extract it.
In GitLab by @martingerlach on Jul 15, 2022, 18:45
Commented on src/parse/elements.py line 126
ok, perfect then.
In GitLab by @martingerlach on Jul 15, 2022, 18:45
resolved all threads
In GitLab by @martingerlach on Jul 15, 2022, 20:16
approved this merge request
In GitLab by @martingerlach on Jul 15, 2022, 20:16
mentioned in commit b7decbd6775a94fb26a1f8771f85c8e0965a3573
Merges 28-reference-extraction -> main
References are identified by looking for
{"class": "mw-reference-text"}
attributes inside<span>
tags. We also store theid
of references that can help track the position where the reference was used.Closes #28