Closed ckot closed 5 years ago
Hey @ckot , I see your new issues. I'd like to close out existing PRs before opening related new ones, if only to keep things neat. Would you be able to make the changes I mentioned above? If not, I'm happy to do this myself — it's a small change :) — but didn't want to take the credit for your good work. Let me know!
I changed the named param to 'include_headers' as you requested, and tested (via my own application which uses this, due to this not having a unit test).
I agree regarding wanting to push this PR before having me push out any more. Although all my PRs have been quite simple, it's easier for me as well to not have multiple outstanding PRs and thus have lots of branches on my fork and need to keep them all in sync.
Rather than requiring the user to parse out section headings from the extracted page text, I added a
keep_section_headings
named param (default True) torecords()
Description
Section headings are currently kept in the page text, requiring the user to manually strip them out.
Motivation and Context
Section headings remain in the page text, delimited by newlines, but unfortunately it requires intelligence to determine whether they are a section heading or simply a short sentence
How Has This Been Tested?
I simply saved the pages text to files and verified that the section headings (and page-title) aren't present in the page text when I pass False to this parameter. Everything else is the same.
Screenshots (if appropriate):
Types of changes
Checklist: