18F / federal-grant-reporting

Improving the experience of federal grant reporting.
Other
1 stars 7 forks source link

Explore fixing anomalies in the findings text spacing #194

Closed bpdesigns closed 4 years ago

bpdesigns commented 4 years ago

Problem When extracting findings text from a PDF and findings text from a SF-SAC (FAC data collection form) there are odd returns to lines of text.

User story

As a grants manager or auditor, I want to be able to copy and paste findings text with proper formatting so I don't have to manually retype the data into the agency audit systems.

Hypothesis

Uncovering a way to resolve findings text from the PDF and from the SF-SAC might be similar.

Definition of done

@danielnaab and @cantsin lets use this issue to post what we learn about fixing formatting in the text to see if there is a common solution.

cantsin commented 4 years ago

On the PDF side, pdfminer appends newlines when the span ends (which oftentimes means newlines inserted in the middle of a sentence). When that happens, I think it's simpler to do text.strip().replace('\n', '') and move on. But sometimes PDFs emit a NO-BREAK SPACE (U+00A0), which throws off a lot of textual analysis. Likely there are other Unicode sequences lurking about.

Curious to see if SF-SAC text has similar issues.

bpdesigns commented 4 years ago

@cantsin what questions do you have for the FAC? I will try again to get in touch with the lead of that team to see if they have any answers.

cantsin commented 4 years ago

No questions. I think we're stuck with what we have, unfortunately, wrt findings text spacing, as these are artifacts of the PDF text extraction process. We can close.

bpdesigns commented 4 years ago

Thanks @cantsin