Closed appledora closed 2 years ago
requested review from @geohci
added 2 commits
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 17
thanks for adding this documentation! this makes me realize that we probably want to adhere to mwparserfromhell names where possible to make usage as consistent as possible. for wikilink, they have a title and text attributes that we could mirror (and then obviously we'll have a bunch of attributes not in mwparserfromhell). https://github.com/earwig/mwparserfromhell/blob/7738752b016b4c8f4dd3b381ca125724ac2a3af7/src/mwparserfromhell/nodes/wikilink.py#L28
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 21
to be more clear "...if the wikilink leads to a disambiguation page"
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 24
to be more clear: "...if the wikilink was transcluded onto the page."
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 51
I think this should actually be something like self.bare (enwiki vocabulary I've seen) or self.autolinked (HTML spec vocabulary) as I think it captures links that don't have brackets.
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 26
My inclination is to not include attributes like this that don't contain new information -- i.e. are derived directly from a combination of other attributes. Open to thoughts about the pros/cons: in some cases, having a shortcut can be worth the overhead if it's widely used/understood but I'm not sure in this case if we know what specific combinations of attributes folks will care about.
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/elements.py line 78
what other things do we see in the about
property? should this be a stricter if html_string["about"].startswith("mwt"):
to avoid false positives and be slightly more efficient? or is the startswith assumption not always true?
In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/utils.py line 171
low priority but a few quick thoughts:
type
is a built-in so we probably don't want to use that either.type
parameter being used?In GitLab by @geohci on Jul 1, 2022, 24:11
Commented on src/parse/article.py line 76
can you add a test case or two to test the new function? as we expand out the library, keeping tests and classes/functions aligned will be more important
yeah, I will create an issue for it and work on it separately? what do you think?
possibly, auto imported by my ide. removing this.
Usually, the about
attribute appears like this about="#mwtn"
in the HTML where n is a numerical value. So, making it stricter with .startswith()
makes sense.
yes, should definitely change the type
param and also rename the function. I noticed that, link/title patterns are different for different element (different prefixes, different space separators etc.). Primarily, I only identified for Category and Template, and wondering whether other different patterns for other different elements also exist and normalize the titles/links based on that separately?
so instead of self.standard
, we use self.bare/autolinked
, right?
changed this line in version 6 of the diff
changed this line in version 7 of the diff
changed this line in version 8 of the diff
changed this line in version 9 of the diff
an updated version of this method is actually present on the other MR xD
changed this line in version 10 of the diff
Good idea, honestly. I should have kept this in my mind. I will go through this, before making new changes.
would it be odd if I just don't mention one attribute i.e: standard
in the docstring?
In GitLab by @geohci on Jul 1, 2022, 04:28
Commented on src/parse/elements.py line 51
yep -- i don't have a strong preference but maybe autolinked is more descriptive?
In GitLab by @geohci on Jul 1, 2022, 04:29
Commented on src/parse/elements.py line 26
sorry, i commented on the docstring but i intended to ask more broadly (keep it as an attribute or not)
In GitLab by @geohci on Jul 1, 2022, 04:30
Commented on src/parse/article.py line 76
yep, that's fine by me. thanks!
Ohhh... I think I get it now. That Wikilinks (or similar elements) should be by default "standard", while their other states or attributes may change. Users would expect them to be normal by default, but their interests might lie in whether such links are transcluded or point to a disambiguation link. Am I going in the right direction with this?
changed this line in version 12 of the diff
In GitLab by @geohci on Jul 1, 2022, 20:47
Commented on src/parse/elements.py line 26
yep! this isn't a hard rule that everything has a default that we don't make explicit -- e.g., technically if an external link is neither numbered or named, you know it's autolinked. But it's still useful to have autolinked as an attribute because the user shouldn't need to know that there are only three types of external link formats etc. but my feeling with this one is that we can leave it as implicit that a "standard" blue link is the combination/absence of a few explicit properties.
In GitLab by @geohci on Jul 1, 2022, 20:47
Commented on src/parse/elements.py line 17
thanks -- do you want to do this as part of this merge request or do it more broadly in a future MR? if the former, let me know and we can probably merge this.
Tbh, the change is really small so i can fix it right away. But the only issue being, for wikilinks, would there be any difference between a title
and the text
?
Although, in the mwparserfromhell
, for the text
attribute they do mention "if any". So we can take the same approach, i guess?
resolved all threads
In GitLab by @geohci on Jul 1, 2022, 21:44
Commented on src/parse/elements.py line 8
i'd take this out -- we can do the strip() part when we convert to plaintext but here we should preserve the raw data
In GitLab by @geohci on Jul 1, 2022, 21:44
Commented on src/parse/elements.py line 12
a few things for these functions:
changed this line in version 15 of the diff
changed this line in version 15 of the diff
Merges 13-extlink-extraction -> main
Closes #13