appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Resolve "Create Documentation" - [merged] #60

Closed appledora closed 2 years ago

appledora commented 2 years ago

Merges 39-create-documentation -> main

Have started to create a basic README based documentation structure. The Example Usage syntaxes would change slightly after we deploy to PyPI repo.

Closes #39

appledora commented 2 years ago

requested review from @martingerlach

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 16:13

Commented on README.md line 1

what about mwparserfromhtml? I liked that because it is even closer to mwparserfromhell.

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 16:13

Commented on README.md line 1

Also mwparserfromhtml is more clear in terms of what it does.

appledora commented 2 years ago

wait a minute, I think we had originally decided on mwparserfromhtml and I completely forgot about it when writing the documentation -_- Apologies!!

appledora commented 2 years ago

changed this line in version 3 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:20

resolved all threads

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:29

Commented on README.md line 3

typo

`mwparserfromhtml` is a Python library for parsing and mining metadata from the Enterprise HTML Dumps that has been recently made available by the [Wikimedia Enterprise](https://enterprise.wikimedia.com/). The 6 most updated Enterprise HTML dumps can be accessed from [*this location*](https://dumps.wikimedia.org/other/enterprise_html/runs/). The aim of this library is to provide an interface to work with these HTML dumps and extract the most relevant features from an article.
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:40

Commented on README.md line 8

When rendering contents, MediaWiki converts wikitext to HTML, allowing for the expansion of macros to include more material. The HTML version of a Wikipedia page generally has more information than the original source wikitext. So, it's reasonable that anyone who wants to analyze Wikipedia's content as it appears to its readers would prefer to work with HTML rather than wikitext. Traditionally, only the wikitext version has been available in the [XML-dumps](https://dumps.wikimedia.org/backup-index.html). Now, with the introduction of the Enterprise HTML dumps in 2021, anyone can now easily access and use HTML dumps (and they should). 
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:49

Commented on README.md line 10

However, parsing HTML to extract the necessary information is not a simple process. An inconspicuous user may know how to work with HTMLs but they might not be used to the specific format of the dump files. Also the wikitext translated to HTMLs by the MediaWiki API have many different edge-cases and requires heavy investigation of the documentation to get a grasp of the structure. Identifying the features from this HTML is no trivial task! Because of all these hassles, it is likely that individuals would continue working with wikitext as there are already excellent ready-to-use parsers for it (such as [mwparserfromhell](https://github.com/earwig/mwparserfromhell)). 
Therefore, we wanted to write a Python library that can efficiently parse the HTML-code of an article from the Wikimedia Enterprise dumps to extract relevant elements such as text, links, templates, etc. This will hopefully lower the technical barriers to work with the HTML-dumps and empower researchers and others to take advantage of this beneficial resource. 
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:50

Commented on README.md line 16

* Generate summary statistics for the articles in the dump
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:50

Commented on README.md line 15

* Easily extract the content of an article from the HTML dump and customizing the level of detail
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:51

Commented on README.md line 25

Question (I am very naive): Does this automatically solve any dependencies on other packages such as BeautifulSoup?

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:55

Commented on README.md line 45

* Extract the plain text of an article from the dump, i.e. remove anything that is not text (e.g. a link is replaced by its [anchor text](Anchor_text)):
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 19:57

Commented on README.md line 71

* Generate summary statistics of the dump:
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 20:00

Commented on README.md line 63

* Parse HTML string of a Wikipedia article (in a file `FILE.html`) and extract features (such as templates) 
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 20:10

Commented on README.md line 87

This project was started as part of an [Outreachy](https://www.outreachy.org/) internship from May--August 2022. This project has benefited greatly from the work of Earwig ([mwparserfromhell](https://github.com/earwig/mwparserfromhell)) and Slavina Stefanova ([mwsql](https://github.com/mediawiki-utilities/python-mwsql)). 
appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 20:11

Commented on README.md line 80

Question: will there be links to these items?

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 19, 2022, 20:24

Commented on README.md line 49

Should we explain some of these arguments? (either here or in the function in article.py)

appledora commented 2 years ago

changed this line in version 4 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 5 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Aug 19, 2022, 23:49

Commented on README.md line 87

URLs here i think for mwparserfromhell and mwsql

appledora commented 2 years ago

Yes, I think once we build an appropriate setup.py script before packaging for pip - these issues would be handled internally? @geohci ??

appledora commented 2 years ago

technically, we don't do any replacements in this version. We only don't print (skip_category skips the categories from the generated plaintext etc.) the text from particular elements.

appledora commented 2 years ago

changed this line in version 6 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

In GitLab by @geohci on Aug 20, 2022, 24:14

Commented on README.md line 25

yep -- writing that now :) but will be taken care of

appledora commented 2 years ago

changed this line in version 7 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 8 of the diff

appledora commented 2 years ago

changed this line in version 8 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

I am not sure whether adding the argument explanations with the README.md is a sustainable idea. Instead, I was actually planning to create a tutorial notebook explaining the type of data we are handling and the functionalities of the library in depth. Besides that, I would also add argument definitions in the article.py too. Or would you suggest, we should add a heads-up here too?

appledora commented 2 years ago

changed this line in version 9 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 10 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

changed this line in version 11 of the diff

appledora commented 2 years ago

added 1 commit

Compare with previous version

appledora commented 2 years ago

Like from a list of contents?

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 22, 2022, 14:55

Commented on README.md line 25

ok, then nothing to do here.

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 22, 2022, 15:07

Commented on README.md line 45

From the code in utils.df

elif tag_obj.name in ["Wikilink", "ExternalLink", "Category"]:
    if skip_categories and tag_obj.name == "Category":
        continue
    else:
        yield tag_obj.plaintext if len(
            tag_obj.plaintext
        ) > 0 else tag_obj.title

I get that a WikiLink-object is replaced by its plaintext (i.e. the anchor-text of the link). Is my understanding correct?

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 22, 2022, 15:10

Commented on README.md line 80

Currently it is only a bulleted list:

  • Licensing
  • Issue Tracker
  • Documentation
  • Contribution Guidelines

Will there be any addition information here about these items? For example, where is the issue-tracker or what are the contribution guidelines? or is this only a placeholder for the future?

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 22, 2022, 15:12

Commented on README.md line 49

What you suggest sounds good:

appledora commented 2 years ago

added 7 commits

Compare with previous version

appledora commented 2 years ago

ah....yes, you are correct. For these particular elements - we have to do bit more processing to get the plaintext. I am resolving this thread :smiley:

appledora commented 2 years ago

changed this line in version 13 of the diff