chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.22k stars 250 forks source link

added a named param to records() to allow stripping of section headings #224

Closed ckot closed 5 years ago

ckot commented 5 years ago

Rather than requiring the user to parse out section headings from the extracted page text, I added a keep_section_headings named param (default True) to records()

Description

Section headings are currently kept in the page text, requiring the user to manually strip them out.

Motivation and Context

Section headings remain in the page text, delimited by newlines, but unfortunately it requires intelligence to determine whether they are a section heading or simply a short sentence

How Has This Been Tested?

I simply saved the pages text to files and verified that the section headings (and page-title) aren't present in the page text when I pass False to this parameter. Everything else is the same.

Screenshots (if appropriate):

Types of changes

Checklist:

bdewilde commented 5 years ago

Hey @ckot , I see your new issues. I'd like to close out existing PRs before opening related new ones, if only to keep things neat. Would you be able to make the changes I mentioned above? If not, I'm happy to do this myself — it's a small change :) — but didn't want to take the credit for your good work. Let me know!

ckot commented 5 years ago

I changed the named param to 'include_headers' as you requested, and tested (via my own application which uses this, due to this not having a unit test).

I agree regarding wanting to push this PR before having me push out any more. Although all my PRs have been quite simple, it's easier for me as well to not have multiple outstanding PRs and thus have lots of branches on my fork and need to keep them all in sync.