Closed FinnWoelm closed 5 years ago
The new /rmeta/text
endpoint works, but we need to #strip
the text of any starting or ending line breaks. So we still have the problem of losing any starting or ending line breaks that are actually part of the document body.
One approach may be to also get the text via the /tika
endpoint, counting beginning and ending line breaks, and re-adding those to the text extracted via /rmeta/text
.
The currect method of plain text extraction from documents is causing incorrect line breaks to be inserted into text:
The 'added' line breaks above do not actually exist in the document.
The problem is that we're currently extracting plain text by first using Apache Tika's HTML extraction and then using Nokogiri's HTML to text method. That's because Tika's text extraction includes unwanted artifacts, such as
[image: ...]
and[bookmark: ...]
. See the TIKA-2755.The solution to this is to use the Tika server's
/rmeta/text
endpoint (as opposed to/tika
) and then parse the JSON response. The plain text content is in the "X-TIKA:content" field.This field includes a bunch of line breaks at the beginning of the document. These need to be stripped. One approach may be doing a doc->HTML with Tika and counting the number of linebreaks outside of the document body and then removing that number of line breaks from the plain text. That way, you only remove the non-document line breaks. Refer to my comment in TIKA-2755 for context.