Closed appledora closed 2 years ago
requested review from @martingerlach
In GitLab by @martingerlach on Jul 1, 2022, 16:43
Commented on src/parse/article.py line 8
Should we also remove this import as we are not using this currently (I think we discussed this in another context too?)
In GitLab by @martingerlach on Jul 1, 2022, 16:47
Commented on src/parse/article.py line 85
in get_wikilinks you are using regex, should we do the same here too, i.e. re.compile("mw:PageProp/Category")? what would be the difference in the output?
In GitLab by @martingerlach on Jul 1, 2022, 17:03
Commented on src/parse/utils.py line 90
I think this will only work in English Wikipedia. While I havent checked the formatting in the HTML-dumps, I am sure the link-title will contain the name of the category-namespace in the corresponding wiki. For example, the category https://en.wikipedia.org/wiki/Category:Perception in English Wikipedia appears as https://de.wikipedia.org/wiki/Kategorie:Wahrnehmung in German Wikipedia.
changed this line in version 3 of the diff
Yes, auto-import by my ide. Thanks for pointing this out!
So the idea here is that, in Wikilink, using regex helped me to also capture the interwikilinks that has the format mw:Wikilink/interwiki
. In case of categories, i had to use a stricter pattern because there is also another rel
value present in the html that looks something like mw:PageProp/Categorysort...
- which we don't want for the object here. I hope I have managed to explain my thoughts here on this :3
This makes sense. What would you suggest here?
In GitLab by @martingerlach on Jul 1, 2022, 22:33
Commented on src/parse/article.py line 85
Thanks for the explanation. That makes sense if the pattern for categories is much more narrowly defined. Then looks good to me.
In GitLab by @geohci on Jul 1, 2022, 22:39
Commented on src/parse/utils.py line 90
i have an issue to add these namespace names to a static file somewhere that can be imported and used for namespace detection / link cleaning so probably will have to wait till that (sorry)
In GitLab by @geohci on Jul 1, 2022, 22:39
now that MR5 is merged, should we rebase this to clean up?
I actually have very little experiences with git-rebase (scared, to be honest). But after MR#5 I was thinking of pulling those changes to this branch, modify these codes accordingly and push to the same MR#4 again. Is that the same thing?
In GitLab by @martingerlach on Jul 1, 2022, 22:43
Commented on src/parse/utils.py line 90
For now we could:
In GitLab by @geohci on Jul 1, 2022, 22:49
thankfully can't break anything but one approach to test first is to create another local repository on your computer, checkout this merge request using the approach i mentioned in Slack, do the git-rebase and then manually verify. if all goes well, just delete the copy of the repo and do git-rebase for real. if not, debug. i'm not sure about your proposed approach (might work) but git rebase is good to learn.
changed this line in version 4 of the diff
added 7 commits
main
I took the second approach here.
I have tried rebasing, manually corrected the conflicts and did git push -f
. Could you verify if it's okay, @geohci ?
added 1 commit
resolved all threads
In GitLab by @martingerlach on Jul 4, 2022, 14:07
Commented on src/parse/article.py line 7
Category should be imported only once
In GitLab by @martingerlach on Jul 4, 2022, 14:29
Commented on src/parse/utils.py line 90
so this essentially aims to strip the namespace-title from the article-title. I would add a short comment to that line so we know immediately what this is doing.
however, I do wonder if this approach works in general for all titles (and not just categories). Some article titles seem to contain ":", where it is part of the actual title and is not just a marker of the namespace. In this case, the normalization would introduce an error. A quick check for English Wikipedia showed there are 35k articles in namespace 0 with ":" in its titles (another 100k titles from redirects). Thus, we might want to go with the former suggestion or restrict the current approach only to type=category.
In addition, I would suggest to avoid duplicating link.strip().replace("_", " ")
in several different lines. We might improve this later and then might miss to apply changes to both instances.
changed this line in version 7 of the diff
I have currently added the comment and updated the duplicate line. Regarding the issue of links containing ":" themselves, do all links start with a namespace marker before them? e.g: "./Template:www.linksomething.com" , "./Category:www.somethingelse.com"
In GitLab by @martingerlach on Jul 4, 2022, 23:47
Commented on src/parse/utils.py line 90
I think that the link-title always starts with "\<namespace>:" except if it is the main namespace (an internal link to another article). The problem comes from the latter: If the title contains ":" then normalizing the title via link = link.split(":", 1)[1]
will split off part of the actual link-title. For example, a link to the article Star_Trek:_Enterprise would yield "_Enterprise" (which is the wrong article).
It seems, for now we have know other way but to use the namespace mapping. However, there might still be ambiguations, right? What if "Template: A cool anime name" is an actual internal wikilink?
In GitLab by @geohci on Jul 5, 2022, 02:23
Commented on src/parse/utils.py line 90
I believe there are restrictions that prevent ambiguous article names like that: https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Other_problematic_characters
I put together the patch that will include all the namespace names on Wikipedia so we should be close to adding namespace detection to the Wikilink class and will create that merge request when this is merged. makes me wonder if this function should move under the Wikilink class (and Category/File classes would have their own too) rather than try to have a single function that covers all the different types?
In GitLab by @geohci on Jul 5, 2022, 02:23
Commented on src/parse/article.py line 7
I think the ExternalLink import was accidentally removed
In GitLab by @geohci on Jul 5, 2022, 02:23
Commented on src/parse/utils.py line 14
what is the purpose of this code? if it's for plaintext generation, should we instead replace it with the html library that mwparserfromhell uses? if that's the case, we can probably just remove and leave for a later merge request when we get to that stage. same for the htmlDecode function later in this file if it's not being used currently.
In GitLab by @geohci on Jul 5, 2022, 02:23
I think so...
Oh yeah, currently they are not being used in any function but I had used them previously as handy utils while extracting plaintexts. It is supposed to remove the html entities
from strings e.g: >, < etc.
changed this line in version 9 of the diff
changed this line in version 10 of the diff
@geohci I am little unclear about your suggestion here tbh. We can definitely write class specific title normalization method, and it would look (and read) much better too, I think. Do you suggest we use a new issue for that?
In GitLab by @geohci on Jul 5, 2022, 19:57
Commented on src/parse/utils.py line 90
We can definitely write class specific title normalization method, and it would look (and read) much better too, I think. Do you suggest we use a new issue for that?
Yep, that sounds perfect!
In GitLab by @geohci on Jul 5, 2022, 21:14
resolved all threads
In GitLab by @geohci on Jul 5, 2022, 21:15
mentioned in commit fa34018558a85fd56506cebf028ec94a813f9557
Merges 7-category-extraction -> main
Closes #7