dijs / wiki

Wikipedia Interface for Node.js
MIT License
315 stars 61 forks source link

fixes article section headers regex (thanks to Wiktor Stribiżew from SO) #112

Closed mYnDstrEAm closed 5 years ago

mYnDstrEAm commented 5 years ago

I fixed the regex pattern for parsing article section headers with the help of Wiktor Stribiżew.

Previously it could not parse headings with various special characters such as commas etc.
Newlines are excluded because these break the wiki syntax for section headers.

The documentation should also feature a short demo on how to get/use the sections. Didn't add any as there's no folder for v5.0 yet.

I use it basically like this:

wiki({ apiUrl: 'https://en.wikipedia.org/w/api.php' })
    .page('Batman')
    .then(page => page.content())
    .then(content => {
      let sections = []
      content
        .filter(c => {
          return c.title.toLowerCase().includes('history')
        })
        .forEach(s => sections.push(s))
      console.log(sections)
    })

The section parsing needs to be improved further so that you can easily specify e.g. the parent-section and the level(s) of sections you're interested in. I would suggest to not delete section.level etc. in parseContent.

Also this doesn't yet check for <code> as can be found in the article "Relational operator".

dijs commented 5 years ago

Thank you. I do not think the test failures are because of this change. I will fix them and get this in soon.