PedroArvela / sentido

SENse Through InDuctiOn
1 stars 0 forks source link

Scrap Portuguese Wikipedia #4

Closed PedroArvela closed 6 years ago

PedroArvela commented 6 years ago

In gitlab by @PedroArvela on Feb 20, 2017, 15:22

Extract articles from Portuguese Wikipedia.

Split articles into paragraphs, use the following as separator.

 . fim-de-parágrafo . 
PedroArvela commented 6 years ago

In gitlab by @PedroArvela on Feb 23, 2017, 16:01

CETEMPúblico also has paragraph separations in their XML format.

PedroArvela commented 6 years ago

In gitlab by @PedroArvela on Feb 23, 2017, 16:01

Using the 20170201 Snapshot for Wikipedia.