add functions to extract plaintexts to library #30

Closed appledora closed 2 years ago

In GitLab by @appledora on Jul 12, 2022, 15:45

In GitLab by @geohci on Jul 15, 2022, 22:23

Potentially interesting plaintext does appear outside of  tags though the vast majority of plaintext does seem to be found in  tags. Maybe  tags too? Tables seem to mostly contain facts/data but not fully-formed sentences.
However lots of non-interesting text appears within  tags too -- e.g., the stub template text -- so filtering to  tags alone is insufficient as a filter.
Knowing whether a  element came from a template or not is an obvious filter that would help reduce the redundant text without needing to build a database of sentences and how often they appear.

To address this we can traverse through all the body tags inside [except for styles, meta etc], identify their types and keep/ignore specific tags. It's quite trivial to identify tables because they always start with the <table> tag.
It is also trivial to identify the stubs because we have identified the specific class associated with them.
As a rule-of-thumb, so far we have observed that, templates usually have a about attribute which has a value in the form #mwtN (N representing a number). This can be approached in two ways, i think :
- we can traverse each node/tag and check if it's a template
- right at the start we can rip out all the templates the same way we remove all the useless tags like style and meta.

But overall, what remains to be more confusing for me, is how we should structure the output of this method.

In GitLab by @martingerlach on Aug 18, 2022, 14:00

mentioned in commit d48e18f787088ea8afd6d0d9b2ed0677c10300d1

appledora / mwparserfromhtml