Closed appledora closed 2 years ago
requested review from @geohci
I don't want make this a part of the class instance.
but i also don't want to rewrite this method for multiple subclasses. What could be a work around?
added 1 commit
added 1 commit
added 1 commit
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/dump/dump.py line 9
this makes sense to me. just a note (no change needed now) that when we merge the tests over, they'll need their file paths updated too
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/const.py line 3
I like this general approach of having a general template with the details you'd need to extract a new type of element. Do we have any sense of how well this will scale -- e.g., could this cover the example of Categories that were actually under mw:WikiLink? I don't think we need to abandon it if not, but something to consider as we extend coverage.
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/elements.py line 9
taking a step back to make sure we need this type mapping {1:standard, etc.} .
drawbacks:
traditionally the benefit of having an enumeration like this are:
thoughts? we can always keep as is and make this an issue for a future merge request too if it feels like too large of a change/discussion to include in this merge request
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/const.py line 7
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/elements.py line 21
Why don't we use the WIKILINK['name']
attribute here? I might be confused as to the purpose of the different name
attributes
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/utils.py line 171
this is fine as is but we might want to return to it later depending on how it's used. for example, most wikis (but not all) also require the first letter to be capital (https://www.mediawiki.org/wiki/Manual:Page_title) and we might want to remove the section names (#...) from the title too.
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/article.py line 71
what's the difference between a BeautifulSoup object and Parsed Html?
In GitLab by @geohci on Jun 23, 2022, 23:34
Commented on src/parse/article.py line 73
should this be List[str] or List[Wikilink] (honest question - I'm not great at types / docstring norms)
If you mean the docstring part, then it should be List[Wikilink]
. I forgot to update it because this was earlier returning static strings and I had generated the docstrings then. Updating this. Thanks!!
It's actually the same thing. Should I remove it?
I will add the update to the #4 PR
Categories that were actually under mw:WikiLink
mmm... I think we still need a decision on how we should label these examples (i.e: as Categories or as WikiLinks). My personal intuition would be to not have them as Wikilinks and handle separately, while extracting Categories ?
I think, on top of my head, I was thinking that these type
are the actual subclasses of Wikilinks, while redirect/disambiguation are rather the states of these wikilinks. I am not yet sure, which of these is a better practice, tbh. Suggestions?
The name
attribute of the element class, is only there for the __str__()
method and therefore, pretty unimportant. In line 21, if we choose to keep this attribute, WIKILINK['name']
should be used. What should be the better choice here? Removing the name attribute as it doesn't have a super-high importance here so far?
I think we should create an issue for this one, as something we can get back to later.
changed this line in version 5 of the diff
changed this line in version 5 of the diff
Created issue #8
Removed the "Parsed HTML" portion from the docstring.
Fixed the typo here.
It's better to solve the issue at this stage, because similar patterns may arise in extracting the later Elements. I also, do not like how I have handled this here and I was primarily motivated by, reducing memory overhead if there's a large number of duplicate strings that are being stored instead as integers.
. I would really appreciate your suggestions.
In the later parts of this file, I have also created a get_type()
method. I am not sure whether this should be a part of the superclass (Elements) instead. What do you think?
In GitLab by @geohci on Jun 24, 2022, 21:21
Commented on src/parse/elements.py line 9
My general feeling is that we don't need to worry about memory overhead from the type storing. I looked at a very link-rich article as a worse-case scenario and the byte savings from storing integers over strings is only 10% of the size of just the HTML string (ignoring the size of beautifulsoup object etc.) so a minor contribution to memory usage (code). this sort of optimization makes much more sense when e.g., working with language models where words are repeated many times and you want to use arrays to do look-ups instead of hashmaps.
so I'd simplify down at least to just storing the types as strings and not maintaining the lookup (worse case we can always go back later to a lookup if that feels necessary). but I also wonder if we need this generic self.type list or whether to use the more specific self.redlink etc., which are not as general but are easy to access by the user and are explicit about what types are allowed for an element.
if you still want a get_type() function, it would then be element-specific and report the attributes that were True. what's the use case that motivates creating the get_type() function?
In GitLab by @geohci on Jun 24, 2022, 21:21
Commented on src/parse/const.py line 3
yeah, that makes sense. well let's ignore for now so as not to complicate this merge request too much. maybe create an issue with the example you shared with us so we can talk through it and figure out how widespread it is etc.?
In GitLab by @geohci on Jun 24, 2022, 21:21
Commented on src/parse/const.py line 7
there probably is a distinction between e.g., redlinks as opposed to transcluded links but in practice I think it might be confusing to store them differently. I think as long as the type doesn't change what other attributes you store about the Wikilink, it can just be an attribute that the user has easy access to. in general, you'll notice I'm trying to push for simplicity/consistency at least until we have very good reason to add complexity. Category/image/external links are different and need their own class because they probably need different attributes. If we decide interwiki is different enough, we might also make that its own class (or a subtype of external link).
In GitLab by @geohci on Jun 24, 2022, 21:21
Commented on src/parse/elements.py line 21
looks like you could also just get the classname directly (WikiLink in this case) without having to track it elsewhere and that might be easiest. i'm okay with trying to keep a nice name like that for print purposes: https://stackoverflow.com/questions/521502/how-to-get-the-concrete-class-name-as-a-string
Yes, this is a much better idea. I will be removing the lookup dictionary. Instead, converting the subtypes as boolean attributes. I hope I understood you correctly.
Based on the previous review, this won't be needed.
Created issue #11
So, as discussed in a previous review thread, I am converting these(redlink, disambiguation, redirect) to Boolean attributes of the wikilinks class.
self.redlink = True;
self.disambiguation = True;
self.redirect = False
In GitLab by @geohci on Jun 24, 2022, 23:34
Commented on src/parse/elements.py line 9
yeah, let's try this out and we can always adjust later if we have good reason but I think this will be the simplest approach for now.
Hang on, I used to know about this thing. Why did I forget? -_-
resolved all threads
changed this line in version 6 of the diff
changed this line in version 6 of the diff
changed this line in version 6 of the diff
added 5 commits
In GitLab by @martingerlach on Jun 27, 2022, 18:19
Commented on src/parse/elements.py line 25
This refers to an interwiki (as in to-another-wiki) right? If yes, should we name this "self.interwiki"? Otherwise this might cause confusion with internal as within-the-same-wiki link (thats how I read it) and causes confusion with the definition of standard.
changed this line in version 7 of the diff
added 1 commit
In GitLab by @martingerlach on Jun 27, 2022, 18:25
Commented on src/parse/utils.py line 172
Is the indentation correct?
return link.strip().replace("_", " ")
changed this line in version 8 of the diff
Updated the change.
Merges feature/wikilinks -> main
Solves issue #5