Closed appledora closed 2 years ago
requested review from @martingerlach
added 4 commits
main
added 1 commit
added 1 commit
In GitLab by @geohci on Aug 23, 2022, 21:09
Commented on src/parse/article.py line 26
I do think we'll want to remove this attribute. Right now I think it's just misleading ('en' is the language, which we already have, not the page_namespace). What's the reasonining for keeping this in? also, the page_namespace_id
implementation looks good to me -- thanks!
In GitLab by @geohci on Aug 23, 2022, 21:10
Commented on src/parse/utils.py line 248
let's add a quick comment explaining why we skip these
Well, this attribute is actually used in the wiki link namespace extraction task. The NAMESPACE dictionary is nested. The first key is a namespace(I used to think it's only language acronyms, but it also contains simple
) , inside this are the actual wiki namespaces like article
and talks
. To get to the actual namespace we have to do something like NAMESPACE[primary_namespace][secondary_namespace]
. The primary namespace for the wikilinks come from the page_namespace. I actually should call it something other than page_namespace
. SUggestions?
donezo
I just checked and found that the page_namespace
and the page_namespace_id
do not correspond to each other, so we DEFINITELY have to change it. Good grief =_=
changed this line in version 8 of the diff
added 2 commits
In GitLab by @geohci on Aug 23, 2022, 21:27
Commented on src/parse/article.py line 26
ahhh I think I understand now. I had forgotten about our prior conversation about this (https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/12#note_9977). sorry, let me try to clarify:
secondary_namespace
is what I mean by namespace
. This can have two forms -- the numeric identifier (0) or the prefix/name (Main). I prefer that we record the numeric identifier, which is what you have under page_namespace_id
. no change needed there.primary_namespace
is the database name for the wiki. This is what you are currently extracting as self.page_namespace
. You don't need to change the extraction but to avoid confusion, you actually want to call this self.wiki_db
to match how it's usually referred to (i earlier suggested self.wiki
but self.wiki_db
is even clearer).self.wiki_db
but there are some crucial differences -- e.g., simple
is the wiki_db but en
is the language for Simple English Wikipedia. again, no changes needed there.added 1 commit
made the renaming changes!
In GitLab by @geohci on Aug 23, 2022, 21:58
Commented on src/parse/article.py line 26
perfect, thanks!
In GitLab by @geohci on Aug 23, 2022, 21:58
resolved all threads
In GitLab by @geohci on Aug 24, 2022, 24:57
mentioned in commit 03cb911b3de5edd6c1a53784deb9db29940a5d49
Merges 41-metadata-extraction -> main
Extracts all the metadata from the dump json, except for
["article_body", "url", "namespace", "name", "in_language"]
keys.Closes #41