appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

feature: extract external links - [merged] #51

Closed appledora closed 2 years ago

appledora commented 2 years ago

Merges 13-extlink-extraction -> main

Closes #13

appledora commented 2 years ago

requested review from @geohci

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 2 commits

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 17

thanks for adding this documentation! this makes me realize that we probably want to adhere to mwparserfromhell names where possible to make usage as consistent as possible. for wikilink, they have a title and text attributes that we could mirror (and then obviously we'll have a bunch of attributes not in mwparserfromhell). https://github.com/earwig/mwparserfromhell/blob/7738752b016b4c8f4dd3b381ca125724ac2a3af7/src/mwparserfromhell/nodes/wikilink.py#L28

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 21

to be more clear "...if the wikilink leads to a disambiguation page"

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 24

to be more clear: "...if the wikilink was transcluded onto the page."

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 51

I think this should actually be something like self.bare (enwiki vocabulary I've seen) or self.autolinked (HTML spec vocabulary) as I think it captures links that don't have brackets.

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 26

My inclination is to not include attributes like this that don't contain new information -- i.e. are derived directly from a combination of other attributes. Open to thoughts about the pros/cons: in some cases, having a shortcut can be worth the overhead if it's widely used/understood but I'm not sure in this case if we know what specific combinations of attributes folks will care about.

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/elements.py line 78

what other things do we see in the about property? should this be a stricter if html_string["about"].startswith("mwt"): to avoid false positives and be slightly more efficient? or is the startswith assumption not always true?

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/utils.py line 171

low priority but a few quick thoughts:

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/article.py line 76

can you add a test case or two to test the new function? as we expand out the library, keeping tests and classes/functions aligned will be more important

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 24:11

Commented on src/parse/article.py line 1

what's this do?

appledora commented 2 years ago

yeah, I will create an issue for it and work on it separately? what do you think?

appledora commented 2 years ago

possibly, auto imported by my ide. removing this.

appledora commented 2 years ago

Usually, the about attribute appears like this about="#mwtn" in the HTML where n is a numerical value. So, making it stricter with .startswith() makes sense.

appledora commented 2 years ago

yes, should definitely change the type param and also rename the function. I noticed that, link/title patterns are different for different element (different prefixes, different space separators etc.). Primarily, I only identified for Category and Template, and wondering whether other different patterns for other different elements also exist and normalize the titles/links based on that separately?

appledora commented 2 years ago

so instead of self.standard, we use self.bare/autolinked, right?

appledora commented 2 years ago

changed this line in version 6 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 7 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 8 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 9 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

an updated version of this method is actually present on the other MR xD

appledora commented 2 years ago

changed this line in version 10 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

Good idea, honestly. I should have kept this in my mind. I will go through this, before making new changes.

appledora commented 2 years ago

would it be odd if I just don't mention one attribute i.e: standard in the docstring?

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 04:28

Commented on src/parse/elements.py line 51

yep -- i don't have a strong preference but maybe autolinked is more descriptive?

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 04:29

Commented on src/parse/elements.py line 26

sorry, i commented on the docstring but i intended to ask more broadly (keep it as an attribute or not)

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 04:30

Commented on src/parse/article.py line 76

yep, that's fine by me. thanks!

appledora commented 2 years ago

Ohhh... I think I get it now. That Wikilinks (or similar elements) should be by default "standard", while their other states or attributes may change. Users would expect them to be normal by default, but their interests might lie in whether such links are transcluded or point to a disambiguation link. Am I going in the right direction with this?

appledora commented 2 years ago

changed this line in version 12 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 20:47

Commented on src/parse/elements.py line 26

yep! this isn't a hard rule that everything has a default that we don't make explicit -- e.g., technically if an external link is neither numbered or named, you know it's autolinked. But it's still useful to have autolinked as an attribute because the user shouldn't need to know that there are only three types of external link formats etc. but my feeling with this one is that we can leave it as implicit that a "standard" blue link is the combination/absence of a few explicit properties.

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 20:47

Commented on src/parse/elements.py line 17

thanks -- do you want to do this as part of this merge request or do it more broadly in a future MR? if the former, let me know and we can probably merge this.

appledora commented 2 years ago

Tbh, the change is really small so i can fix it right away. But the only issue being, for wikilinks, would there be any difference between a title and the text?

appledora commented 2 years ago

Although, in the mwparserfromhell, for the text attribute they do mention "if any". So we can take the same approach, i guess?

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

resolved all threads

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 21:44

Commented on src/parse/elements.py line 8

i'd take this out -- we can do the strip() part when we convert to plaintext but here we should preserve the raw data

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 21:44

Commented on src/parse/elements.py line 12

a few things for these functions:

appledora commented 2 years ago

changed this line in version 15 of the diff

appledora commented 2 years ago

changed this line in version 15 of the diff