appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

feature: metadata extraction - [merged] #62

Closed appledora closed 2 years ago

appledora commented 2 years ago

Merges 41-metadata-extraction -> main

Extracts all the metadata from the dump json, except for ["article_body", "url", "namespace", "name", "in_language"] keys.

Closes #41

appledora commented 2 years ago

requested review from @martingerlach

appledora commented 2 years ago

added 4 commits

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Aug 23, 2022, 21:09

Commented on src/parse/article.py line 26

I do think we'll want to remove this attribute. Right now I think it's just misleading ('en' is the language, which we already have, not the page_namespace). What's the reasonining for keeping this in? also, the page_namespace_id implementation looks good to me -- thanks!

appledora commented 2 years ago

In GitLab by @geohci on Aug 23, 2022, 21:10

Commented on src/parse/utils.py line 248

let's add a quick comment explaining why we skip these

appledora commented 2 years ago

Well, this attribute is actually used in the wiki link namespace extraction task. The NAMESPACE dictionary is nested. The first key is a namespace(I used to think it's only language acronyms, but it also contains simple) , inside this are the actual wiki namespaces like article and talks. To get to the actual namespace we have to do something like NAMESPACE[primary_namespace][secondary_namespace]. The primary namespace for the wikilinks come from the page_namespace. I actually should call it something other than page_namespace. SUggestions?

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

donezo

appledora commented 2 years ago

I just checked and found that the page_namespace and the page_namespace_id do not correspond to each other, so we DEFINITELY have to change it. Good grief =_=

appledora commented 2 years ago

changed this line in version 8 of the diff

appledora commented 2 years ago

added 2 commits

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Aug 23, 2022, 21:27

Commented on src/parse/article.py line 26

ahhh I think I understand now. I had forgotten about our prior conversation about this (https://gitlab.wikimedia.org/repos/research/html-dumps/-/merge_requests/12#note_9977). sorry, let me try to clarify:

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

made the renaming changes!

appledora commented 2 years ago

In GitLab by @geohci on Aug 23, 2022, 21:58

Commented on src/parse/article.py line 26

perfect, thanks!

appledora commented 2 years ago

In GitLab by @geohci on Aug 23, 2022, 21:58

resolved all threads

appledora commented 2 years ago

In GitLab by @geohci on Aug 24, 2022, 24:57

mentioned in commit 03cb911b3de5edd6c1a53784deb9db29940a5d49