Open lauralorenz opened 8 years ago
So after some research it looks like the page.sections function does not work in the wiki API. Also we are at the disadvantage because you can also return a specific section but you can not say 'give me all sections but section X' which s a big pain in the patoot.
Potential solution is that we do a regex to rid the corpus of the last section. == HEADING NAME == is the pattern we will need to search for and remove evertthing after it.
There are some sections common to wikipedia pages that are not valuable content to us because they contain primarily links or otherwise information that is not contextual human language. In particular the Notes, References, and External Links sections that are optionally included in wikipedia page objects are not valuable to us. This issue is closed when our wikipedia ingestion script returns all sections EXCEPT the Notes, References, and External Links sections when pulling data. You can see information on a Wikipedia Page object's
sections
property andsection()
method here for an idea of how to filter this information with the pythonwikipedia
API.