feature: extract categories and normalize category links - [merged]

appledora commented 2 years ago

Merges 7-category-extraction -> main

Closes #7

appledora commented 2 years ago

requested review from @martingerlach

appledora commented 2 years ago

added 1 commit

52d2db45 - update: added documentation

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 1, 2022, 16:43

Commented on src/parse/article.py line 8

Should we also remove this import as we are not using this currently (I think we discussed this in another context too?)

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 1, 2022, 16:47

Commented on src/parse/article.py line 85

in get_wikilinks you are using regex, should we do the same here too, i.e. re.compile("mw:PageProp/Category")? what would be the difference in the output?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 1, 2022, 17:03

Commented on src/parse/utils.py line 90

I think this will only work in English Wikipedia. While I havent checked the formatting in the HTML-dumps, I am sure the link-title will contain the name of the category-namespace in the corresponding wiki. For example, the category https://en.wikipedia.org/wiki/Category:Perception in English Wikipedia appears as https://de.wikipedia.org/wiki/Kategorie:Wahrnehmung in German Wikipedia.

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

added 1 commit

f552c4d4 - Update article.py

Compare with previous version

appledora commented 2 years ago

Yes, auto-import by my ide. Thanks for pointing this out!

appledora commented 2 years ago

So the idea here is that, in Wikilink, using regex helped me to also capture the interwikilinks that has the format mw:Wikilink/interwiki. In case of categories, i had to use a stricter pattern because there is also another rel value present in the html that looks something like mw:PageProp/Categorysort... - which we don't want for the object here. I hope I have managed to explain my thoughts here on this :3

appledora commented 2 years ago

This makes sense. What would you suggest here?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 1, 2022, 22:33

Commented on src/parse/article.py line 85

Thanks for the explanation. That makes sense if the pattern for categories is much more narrowly defined. Then looks good to me.

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 22:39

Commented on src/parse/utils.py line 90

i have an issue to add these namespace names to a static file somewhere that can be imported and used for namespace detection / link cleaning so probably will have to wait till that (sorry)

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 22:39

now that MR5 is merged, should we rebase this to clean up?

appledora commented 2 years ago

I actually have very little experiences with git-rebase (scared, to be honest). But after MR#5 I was thinking of pulling those changes to this branch, modify these codes accordingly and push to the same MR#4 again. Is that the same thing?

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 1, 2022, 22:43

Commented on src/parse/utils.py line 90

For now we could:

keep the namespace and only strip the "./" so we keep the article-title. once we have Isaac's namespace-mapping, we can pass the full article title and split into namespace and title without namespace
strip everything before the first ":" as a naive way to strip namespace information. in this specific case, we know the namespace (category) from the annotation of the link in the HTML

appledora commented 2 years ago

In GitLab by @geohci on Jul 1, 2022, 22:49

thankfully can't break anything but one approach to test first is to create another local repository on your computer, checkout this merge request using the approach i mentioned in Slack, do the git-rebase and then manually verify. if all goes well, just delete the copy of the repo and do git-rebase for real. if not, debug. i'm not sure about your proposed approach (might work) but git rebase is good to learn.

appledora commented 2 years ago

changed this line in version 4 of the diff

appledora commented 2 years ago

added 7 commits

f552c4d4...d810eda8 - 2 commits from branch main
08ba9487 - feature: extract categories and normalize category links
53c132b5 - update: added documentation
dd321cc6 - Update article.py
406605fa - update: rebased main after MR#5
2a3c267d - update: rebased main after MR#5

Compare with previous version

appledora commented 2 years ago

I took the second approach here.

appledora commented 2 years ago

I have tried rebasing, manually corrected the conflicts and did git push -f. Could you verify if it's okay, @geohci ?

appledora commented 2 years ago

added 1 commit

52d9753f - update: removed attribute from Category class

Compare with previous version

appledora commented 2 years ago

added 1 commit

7c2d8c71 - update: minor typo fix in documentation

Compare with previous version

appledora commented 2 years ago

resolved all threads

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 4, 2022, 14:07

Commented on src/parse/article.py line 7

Category should be imported only once

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 4, 2022, 14:29

Commented on src/parse/utils.py line 90

so this essentially aims to strip the namespace-title from the article-title. I would add a short comment to that line so we know immediately what this is doing.

however, I do wonder if this approach works in general for all titles (and not just categories). Some article titles seem to contain ":", where it is part of the actual title and is not just a marker of the namespace. In this case, the normalization would introduce an error. A quick check for English Wikipedia showed there are 35k articles in namespace 0 with ":" in its titles (another 100k titles from redirects). Thus, we might want to go with the former suggestion or restrict the current approach only to type=category.

In addition, I would suggest to avoid duplicating link.strip().replace("_", " ") in several different lines. We might improve this later and then might miss to apply changes to both instances.

appledora commented 2 years ago

changed this line in version 7 of the diff

appledora commented 2 years ago

added 1 commit

bf727afa - Update article.py

Compare with previous version

appledora commented 2 years ago

added 1 commit

a6ad3dec - Update utils.py

Compare with previous version

appledora commented 2 years ago

I have currently added the comment and updated the duplicate line. Regarding the issue of links containing ":" themselves, do all links start with a namespace marker before them? e.g: "./Template:www.linksomething.com" , "./Category:www.somethingelse.com"

appledora commented 2 years ago

In GitLab by @martingerlach on Jul 4, 2022, 23:47

Commented on src/parse/utils.py line 90

I think that the link-title always starts with "\<namespace>:" except if it is the main namespace (an internal link to another article). The problem comes from the latter: If the title contains ":" then normalizing the title via link = link.split(":", 1)[1] will split off part of the actual link-title. For example, a link to the article Star_Trek:_Enterprise would yield "_Enterprise" (which is the wrong article).

appledora commented 2 years ago

It seems, for now we have know other way but to use the namespace mapping. However, there might still be ambiguations, right? What if "Template: A cool anime name" is an actual internal wikilink?

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 02:23

Commented on src/parse/utils.py line 90

I believe there are restrictions that prevent ambiguous article names like that: https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)#Other_problematic_characters

I put together the patch that will include all the namespace names on Wikipedia so we should be close to adding namespace detection to the Wikilink class and will create that merge request when this is merged. makes me wonder if this function should move under the Wikilink class (and Category/File classes would have their own too) rather than try to have a single function that covers all the different types?

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 02:23

Commented on src/parse/article.py line 7

I think the ExternalLink import was accidentally removed

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 02:23

Commented on src/parse/utils.py line 14

what is the purpose of this code? if it's for plaintext generation, should we instead replace it with the html library that mwparserfromhell uses? if that's the case, we can probably just remove and leave for a later merge request when we get to that stage. same for the htmlDecode function later in this file if it's not being used currently.

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 02:23

I think so...

appledora commented 2 years ago

Oh yeah, currently they are not being used in any function but I had used them previously as handy utils while extracting plaintexts. It is supposed to remove the html entities from strings e.g: &gt, &lt etc.

appledora commented 2 years ago

changed this line in version 9 of the diff

appledora commented 2 years ago

added 1 commit

95eb4dea - Update article.py

Compare with previous version

appledora commented 2 years ago

changed this line in version 10 of the diff

appledora commented 2 years ago

added 1 commit

4ea7d5b4 - Update: cleaned utils.py

Compare with previous version

appledora commented 2 years ago

@geohci I am little unclear about your suggestion here tbh. We can definitely write class specific title normalization method, and it would look (and read) much better too, I think. Do you suggest we use a new issue for that?

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 19:57

Commented on src/parse/utils.py line 90

We can definitely write class specific title normalization method, and it would look (and read) much better too, I think. Do you suggest we use a new issue for that?

Yep, that sounds perfect!

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 21:14

resolved all threads

appledora commented 2 years ago

In GitLab by @geohci on Jul 5, 2022, 21:15

mentioned in commit fa34018558a85fd56506cebf028ec94a813f9557

appledora / mwparserfromhtml

feature: extract categories and normalize category links - [merged] #50