GOLEM-lab / fandom-wiki

Extraction of structured and unstructured information from fandom.com pages
8 stars 0 forks source link

Implement selective template parsing for efficient wikitext parsing #1

Closed txetxedeletxe closed 1 year ago

txetxedeletxe commented 1 year ago

As of today, WikiText template parsing works like follows:

  1. All Wikitext templates (even nested instances) are identified and extracted.
  2. Templates are further parsed into template name and parameters.
  3. Templates are filtered by name.
  4. Paramaters on the filtered templates are further filtered.

Step 1 is very time expensive with complex and long documents, so it needs to be combined with step 3 to lower the amount of work done.

txetxedeletxe commented 1 year ago

Implemented the enhancement together with some additional improvements in commit: 0ac9a536115bf3fdd21d54ef66471ba953aee24a