ayota / ddl_nlp

Repo for DDL research lab project.
2 stars 1 forks source link

Make wikipedia ingestion ignore Notes/References and External Links sections #33

Open lauralorenz opened 8 years ago

lauralorenz commented 8 years ago

There are some sections common to wikipedia pages that are not valuable content to us because they contain primarily links or otherwise information that is not contextual human language. In particular the Notes, References, and External Links sections that are optionally included in wikipedia page objects are not valuable to us. This issue is closed when our wikipedia ingestion script returns all sections EXCEPT the Notes, References, and External Links sections when pulling data. You can see information on a Wikipedia Page object's sections property and section() method here for an idea of how to filter this information with the python wikipedia API.

dvetal commented 8 years ago

So after some research it looks like the page.sections function does not work in the wiki API. Also we are at the disadvantage because you can also return a specific section but you can not say 'give me all sections but section X' which s a big pain in the patoot.

dvetal commented 8 years ago

Potential solution is that we do a regex to rid the corpus of the last section. == HEADING NAME == is the pattern we will need to search for and remove evertthing after it.