chrisbra / wikipedia2text

A commandline tool for querying the Wikipedia
Other
32 stars 4 forks source link

Marker splitting seems broken, at least in english or french #10

Open JulienPalard opened 1 month ago

JulienPalard commented 1 month ago

Currently I'm getting the same output between lynx -dump and wp2text:

$ /usr/bin/lynx -dump http://en.wikipedia.org/wiki/Poitiers | head
   #[1]alternate [2]Edit this page [3]Wikipedia (en) [4]Wikipedia Atom
   feed

   [5]Jump to content

   [ ] Main menu
   Main menu
   (BUTTON) move to sidebar (BUTTON) hide
   Navigation
     * [6]Main page

$ wikipedia2text -l en Poitiers | head
   #alternate Edit this page Wikipedia (en) Wikipedia Atom
   feed

   Jump to content

   [ ] Main menu
   Main menu
   (BUTTON) move to sidebar (BUTTON) hide
   Navigation
     * Main page

feels like the markers are not up to date.

Maybe it would be more convenient to use the Wikipedia API, something like:

curl 'https://fr.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Poitiers&format=json'  | jq -r '.query.pages[.query.pages|keys[0]].extract'  | html2text -utf8 | less

would looks good to me.

chrisbra commented 1 month ago

can you create a PR please?