Sane language-agnostic approach to filtering and info gathering from wikipedia/wikicard

ixt commented 2 years ago

This issue will track progress on extracting and filtering data from wikipedia info cards and from wikipedia text, replacing/succeeding #75 and #74

We could likely extract data from dbpedia to hit most of the wikipedia info card information but this isnt super reliable. And we will still have the issue of translating tonnes of labels (if a source isnt found to generate them). Ideally we should produce something that filters based on structure of content (e.g, looks like a section that is just links to other wikipedia articles but is not in a chart, its likely a see also or similar. Stripping some junk like removal warnings could hopefully be done by selectors, but theres much more that previously thought to be considered).

ixt commented 1 year ago

Theoretically we could switch to DBpedia's datasets for a lot of parts, it is far larger than wikidata and will need some planning prior to ingestion. Marvin Bot should be good enough for this ticket but the bigger ones include other sets outside of the wikimedia foundations control, including DNB #70

ixt commented 1 year ago

https://databus.dbpedia.org/

ixt commented 1 year ago

Turns out the DNB referenced for DBPedia is likely not the same DNB as previously seen :/

InvisiblePlatform / rosetta

Sane language-agnostic approach to filtering and info gathering from wikipedia/wikicard #100