common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

WikiExtractor doesnt extract text for bn, hi #175

Closed arijitx closed 2 years ago

arijitx commented 2 years ago

Hi, I tried the wikiextractor for wikisource dump in bn,hi and es. For bn and hi it doesnt work only extracts one or two words

{"id": "5", "url": "https://bn.wikisource.org/wiki?curid=5", "title": "সানাই/গানের জাল", "text": "সানাই/গানের জাল\n\n<pages index=\"সানাই-রবীন্দ্রনাথ ঠাকুর.djvu\" from=88 to=88 header=1/>"}

While for es it seems to be working.

MichaelKohler commented 2 years ago

Which version of the WikiExtractor are you using locally? The extraction uses an older version. Can you update your version and try again locally? If that doesn't help and the problem persists on the latest version, I would say the bug report should be done in https://github.com/attardi/wikiextractor/issues. If if works with the latest version, I will need to look into updating what we use in the extraction process.