appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Feature/wikilink : issue 5 - [merged] #48

Closed appledora closed 2 years ago

appledora commented 2 years ago

Merges feature/wikilinks -> main

Solves issue #5

appledora commented 2 years ago

requested review from @geohci

appledora commented 2 years ago

I don't want make this a part of the class instance.

appledora commented 2 years ago

but i also don't want to rewrite this method for multiple subclasses. What could be a work around?

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/dump/dump.py line 9

this makes sense to me. just a note (no change needed now) that when we merge the tests over, they'll need their file paths updated too

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/const.py line 3

I like this general approach of having a general template with the details you'd need to extract a new type of element. Do we have any sense of how well this will scale -- e.g., could this cover the example of Categories that were actually under mw:WikiLink? I don't think we need to abandon it if not, but something to consider as we extend coverage.

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/elements.py line 9

taking a step back to make sure we need this type mapping {1:standard, etc.} .

drawbacks:

traditionally the benefit of having an enumeration like this are:

thoughts? we can always keep as is and make this an issue for a future merge request too if it feels like too large of a change/discussion to include in this merge request

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/const.py line 7

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/elements.py line 21

Why don't we use the WIKILINK['name'] attribute here? I might be confused as to the purpose of the different name attributes

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/utils.py line 171

this is fine as is but we might want to return to it later depending on how it's used. for example, most wikis (but not all) also require the first letter to be capital (https://www.mediawiki.org/wiki/Manual:Page_title) and we might want to remove the section names (#...) from the title too.

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/article.py line 71

what's the difference between a BeautifulSoup object and Parsed Html?

appledora commented 2 years ago

In GitLab by @geohci on Jun 23, 2022, 23:34

Commented on src/parse/article.py line 73

should this be List[str] or List[Wikilink] (honest question - I'm not great at types / docstring norms)

appledora commented 2 years ago

If you mean the docstring part, then it should be List[Wikilink]. I forgot to update it because this was earlier returning static strings and I had generated the docstrings then. Updating this. Thanks!!

appledora commented 2 years ago

It's actually the same thing. Should I remove it?

appledora commented 2 years ago

I will add the update to the #4 PR

appledora commented 2 years ago

Categories that were actually under mw:WikiLink mmm... I think we still need a decision on how we should label these examples (i.e: as Categories or as WikiLinks). My personal intuition would be to not have them as Wikilinks and handle separately, while extracting Categories ?

appledora commented 2 years ago

I think, on top of my head, I was thinking that these type are the actual subclasses of Wikilinks, while redirect/disambiguation are rather the states of these wikilinks. I am not yet sure, which of these is a better practice, tbh. Suggestions?

appledora commented 2 years ago

The name attribute of the element class, is only there for the __str__() method and therefore, pretty unimportant. In line 21, if we choose to keep this attribute, WIKILINK['name'] should be used. What should be the better choice here? Removing the name attribute as it doesn't have a super-high importance here so far?

appledora commented 2 years ago

I think we should create an issue for this one, as something we can get back to later.

appledora commented 2 years ago

changed this line in version 5 of the diff

appledora commented 2 years ago

changed this line in version 5 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

Created issue #8

appledora commented 2 years ago

Removed the "Parsed HTML" portion from the docstring.

appledora commented 2 years ago

Fixed the typo here.

appledora commented 2 years ago

It's better to solve the issue at this stage, because similar patterns may arise in extracting the later Elements. I also, do not like how I have handled this here and I was primarily motivated by, reducing memory overhead if there's a large number of duplicate strings that are being stored instead as integers. . I would really appreciate your suggestions.

In the later parts of this file, I have also created a get_type() method. I am not sure whether this should be a part of the superclass (Elements) instead. What do you think?

appledora commented 2 years ago

In GitLab by @geohci on Jun 24, 2022, 21:21

Commented on src/parse/elements.py line 9

My general feeling is that we don't need to worry about memory overhead from the type storing. I looked at a very link-rich article as a worse-case scenario and the byte savings from storing integers over strings is only 10% of the size of just the HTML string (ignoring the size of beautifulsoup object etc.) so a minor contribution to memory usage (code). this sort of optimization makes much more sense when e.g., working with language models where words are repeated many times and you want to use arrays to do look-ups instead of hashmaps.

so I'd simplify down at least to just storing the types as strings and not maintaining the lookup (worse case we can always go back later to a lookup if that feels necessary). but I also wonder if we need this generic self.type list or whether to use the more specific self.redlink etc., which are not as general but are easy to access by the user and are explicit about what types are allowed for an element.

if you still want a get_type() function, it would then be element-specific and report the attributes that were True. what's the use case that motivates creating the get_type() function?

appledora commented 2 years ago

In GitLab by @geohci on Jun 24, 2022, 21:21

Commented on src/parse/const.py line 3

yeah, that makes sense. well let's ignore for now so as not to complicate this merge request too much. maybe create an issue with the example you shared with us so we can talk through it and figure out how widespread it is etc.?

appledora commented 2 years ago

In GitLab by @geohci on Jun 24, 2022, 21:21

Commented on src/parse/const.py line 7

there probably is a distinction between e.g., redlinks as opposed to transcluded links but in practice I think it might be confusing to store them differently. I think as long as the type doesn't change what other attributes you store about the Wikilink, it can just be an attribute that the user has easy access to. in general, you'll notice I'm trying to push for simplicity/consistency at least until we have very good reason to add complexity. Category/image/external links are different and need their own class because they probably need different attributes. If we decide interwiki is different enough, we might also make that its own class (or a subtype of external link).

appledora commented 2 years ago

In GitLab by @geohci on Jun 24, 2022, 21:21

Commented on src/parse/elements.py line 21

looks like you could also just get the classname directly (WikiLink in this case) without having to track it elsewhere and that might be easiest. i'm okay with trying to keep a nice name like that for print purposes: https://stackoverflow.com/questions/521502/how-to-get-the-concrete-class-name-as-a-string

appledora commented 2 years ago

Yes, this is a much better idea. I will be removing the lookup dictionary. Instead, converting the subtypes as boolean attributes. I hope I understood you correctly.

appledora commented 2 years ago

Based on the previous review, this won't be needed.

appledora commented 2 years ago

Created issue #11

appledora commented 2 years ago

So, as discussed in a previous review thread, I am converting these(redlink, disambiguation, redirect) to Boolean attributes of the wikilinks class.

self.redlink = True;
self.disambiguation = True;
self.redirect = False
appledora commented 2 years ago

In GitLab by @geohci on Jun 24, 2022, 23:34

Commented on src/parse/elements.py line 9

yeah, let's try this out and we can always adjust later if we have good reason but I think this will be the simplest approach for now.

appledora commented 2 years ago

Hang on, I used to know about this thing. Why did I forget? -_-

appledora commented 2 years ago

resolved all threads

appledora commented 2 years ago

changed this line in version 6 of the diff

appledora commented 2 years ago

changed this line in version 6 of the diff

appledora commented 2 years ago

changed this line in version 6 of the diff

appledora commented 2 years ago

added 5 commits

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Jun 27, 2022, 18:19

Commented on src/parse/elements.py line 25

This refers to an interwiki (as in to-another-wiki) right? If yes, should we name this "self.interwiki"? Otherwise this might cause confusion with internal as within-the-same-wiki link (thats how I read it) and causes confusion with the definition of standard.

appledora commented 2 years ago

changed this line in version 7 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Jun 27, 2022, 18:25

Commented on src/parse/utils.py line 172

Is the indentation correct?

    return link.strip().replace("_", " ")
appledora commented 2 years ago

changed this line in version 8 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

Updated the change.