appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

add functions to extract plaintexts to library #30

Closed appledora closed 2 years ago

appledora commented 2 years ago

In GitLab by @appledora on Jul 12, 2022, 15:45

appledora commented 2 years ago

In GitLab by @geohci on Jul 15, 2022, 22:23

a few thoughts based on https://public.paws.wmcloud.org/User:Appledora/plaintext_examples.ipynb:

appledora commented 2 years ago
  1. To address this we can traverse through all the body tags inside [except for styles, meta etc], identify their types and keep/ignore specific tags. It's quite trivial to identify tables because they always start with the <table> tag.
  2. It is also trivial to identify the stubs because we have identified the specific class associated with them.
  3. As a rule-of-thumb, so far we have observed that, templates usually have a about attribute which has a value in the form #mwtN (N representing a number). This can be approached in two ways, i think :
    • we can traverse each node/tag and check if it's a template
    • right at the start we can rip out all the templates the same way we remove all the useless tags like style and meta.

But overall, what remains to be more confusing for me, is how we should structure the output of this method.

appledora commented 2 years ago

created branch 32-add-functions-to-extract-plaintexts-to-library to address this issue

appledora commented 2 years ago

In GitLab by @martingerlach on Aug 18, 2022, 14:00

mentioned in commit d48e18f787088ea8afd6d0d9b2ed0677c10300d1