elastic / data-extraction-service

Other
9 stars 1 forks source link

Send files to tikaserver in stream instead of multipart #17

Closed navarone-feekery closed 1 year ago

navarone-feekery commented 1 year ago

Related to https://github.com/elastic/enterprise-search-team/issues/5048

The current iteration of content extraction requires the sender to send a file in multipart. This is an issue for large files because without chunking the data, it will need to be loaded into memory.

These changes alter the proxy endpoints to pass to /tika/text instead of rmeta/*. Unfortunately, rmeta/* endpoints require multipart requests. Here is a /tika/text example from docs for reference.

The following things are impacted by this:

  1. We now get response status codes that can be passed back downstream
  2. We don't have a verbose error anymore from tikaserver, but that information is logged in /var/log/tikaserver.log so someone can check that if they want more information
  3. The response from tikaserver is no longer an array, so some changes to response parsing were also necessary

Checklists

Pre-Review Checklist

Related Pull Requests

https://github.com/elastic/connectors-python/pull/1158

navarone-feekery commented 1 year ago

@artem-shelkovnikov yes that's correct, the rmeta endpoints don't support streaming so I'm using a different endpoint that does. It returns the same extracted content, just the formatting of the response is different. (rmeta also returned a lot of auxilliary information like file type and size, but we aren't using that information so it's okay to lose it).