appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Resolve "add function to extract references to library" - [merged] #55

Closed appledora closed 2 years ago

appledora commented 2 years ago

Merges 28-reference-extraction -> main

References are identified by looking for {"class": "mw-reference-text"} attributes inside <span> tags. We also store the id of references that can help track the position where the reference was used.

Closes #28

appledora commented 2 years ago

requested review from @martingerlach

appledora commented 2 years ago

added 4 commits

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 17:02

Commented on src/parse/elements.py line 126

Can we also catch the content of the reference? this could be just the raw html-code between and to provide the option for further processing the reference by the user without any prefiltering from our side.

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

This would actually be auto captured by self.plaintext attribute in the base Elements class. Although, I just made a push, where I do some preprocessing on this text (replace multiple whitespaces with a single whitespace, following https://gitlab.wikimedia.org/repos/research/copyedit/-/blob/main/utils.py#L160)

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 18:35

Commented on src/parse/elements.py line 126

Ok, that makes sense. I missed that there is the base class which captures that. In this case the user could check self.html_string to investigate the content (e.g. if they wanted to extract links in the reference)?

appledora commented 2 years ago

Yes, they will have access to both the html_string and the plaintext (and all the other base class attributes too). So, if there's a link, they should be able to extract it.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 18:45

Commented on src/parse/elements.py line 126

ok, perfect then.

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 18:45

resolved all threads

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 20:16

approved this merge request

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 15, 2022, 20:16

mentioned in commit b7decbd6775a94fb26a1f8771f85c8e0965a3573