Closed bpdesigns closed 4 years ago
On the PDF side, pdfminer appends newlines when the span ends (which oftentimes means newlines inserted in the middle of a sentence). When that happens, I think it's simpler to do text.strip().replace('\n', '')
and move on. But sometimes PDFs emit a NO-BREAK SPACE
(U+00A0)
, which throws off a lot of textual analysis. Likely there are other Unicode sequences lurking about.
Curious to see if SF-SAC text has similar issues.
@cantsin what questions do you have for the FAC? I will try again to get in touch with the lead of that team to see if they have any answers.
No questions. I think we're stuck with what we have, unfortunately, wrt findings text spacing, as these are artifacts of the PDF text extraction process. We can close.
Thanks @cantsin
Problem When extracting findings text from a PDF and findings text from a SF-SAC (FAC data collection form) there are odd returns to lines of text.
User story
As a grants manager or auditor, I want to be able to copy and paste findings text with proper formatting so I don't have to manually retype the data into the agency audit systems.
Hypothesis
Uncovering a way to resolve findings text from the PDF and from the SF-SAC might be similar.
Definition of done
@danielnaab and @cantsin lets use this issue to post what we learn about fixing formatting in the text to see if there is a common solution.