Text extraction results in unexpected line breaks

The currect method of plain text extraction from documents is causing incorrect line breaks to be inserted into text:

screenshot_2019-02-22 openly

The 'added' line breaks above do not actually exist in the document.

The problem is that we're currently extracting plain text by first using Apache Tika's HTML extraction and then using Nokogiri's HTML to text method. That's because Tika's text extraction includes unwanted artifacts, such as [image: ...] and [bookmark: ...]. See the TIKA-2755.

The solution to this is to use the Tika server's /rmeta/text endpoint (as opposed to /tika) and then parse the JSON response. The plain text content is in the "X-TIKA:content" field.

This field includes a bunch of line breaks at the beginning of the document. These need to be stripped. One approach may be doing a doc->HTML with Tika and counting the number of linebreaks outside of the document body and then removing that number of line breaks from the plain text. That way, you only remove the non-document line breaks. Refer to my comment in TIKA-2755 for context.

OpenlyOne / openly-rails

Text extraction results in unexpected line breaks #342