Closed appledora closed 2 years ago
In GitLab by @geohci on Jul 15, 2022, 22:23
a few thoughts based on https://public.paws.wmcloud.org/User:Appledora/plaintext_examples.ipynb:
<p>
tags though the vast majority of plaintext does seem to be found in <p>
tags. Maybe <span>
tags too? Tables seem to mostly contain facts/data but not fully-formed sentences.<p>
tags too -- e.g., the stub template text -- so filtering to <p>
tags alone is insufficient as a filter.<p>
element came from a template or not is an obvious filter that would help reduce the redundant text without needing to build a database of sentences and how often they appear.<table>
tag. about
attribute which has a value in the form #mwtN
(N representing a number). This can be approached in two ways, i think :
But overall, what remains to be more confusing for me, is how we should structure the output of this method.
created branch 32-add-functions-to-extract-plaintexts-to-library
to address this issue
In GitLab by @martingerlach on Aug 18, 2022, 14:00
mentioned in commit d48e18f787088ea8afd6d0d9b2ed0677c10300d1
In GitLab by @appledora on Jul 12, 2022, 15:45