algolia / algoliasearch-zendesk

Integrate Algolia within your Zendesk Help Center in minutes.
https://community.algolia.com/zendesk
Other
21 stars 12 forks source link

Optional Use of Body_Safe #110

Open HurricanePete opened 5 years ago

HurricanePete commented 5 years ago

Hello @Jerska, just wanted to ask about the possibilities of customizing or deactivating the character limit dictated by body_safe. Something like passing an option to algoliasearchZendeskHC for characterLimit or setting it to false in order to store the entire article body and override the default option here: https://github.com/algolia/algoliasearch-zendesk/blob/7ac413ebe83bf88fa1ac4935547899a9055bec35/crawler/item.rb#L103

It was changed here as a bug fix: https://github.com/algolia/algoliasearch-zendesk/blob/master/CHANGELOG.md#2173-2017-10-17

It does look like crawler options come from Algolia (not Zendesk frontend), but would love to get some extra context about why the change was made and what our options are.

We use the search and instant search for a small knowledge base for our application. Since we don't have too many documents to index, we'd like to try and include the entire document body as part of the searches - since currently a lot of the article is not included in these search functions.

Would be happy to open a PR for this if it makes sense and there's not some other reason not to have it.

Jerska commented 5 years ago

Hi @HurricanePete . Thanks for raising the issue.

The reason for this character limit is that Algolia has a size limit for records. It used to be 100KB, but changed over time to now be 10KB, and we need to leave some available room for the other attributes of the article. https://www.algolia.com/doc/faq/basics/is-there-a-size-limit-for-my-index-records/

Our suggestion in case some articles don't show up for a search query because of the size limit is to add relevant keywords in the tags of your article.

The long term solution would be https://github.com/algolia/algoliasearch-zendesk/issues/54 . The idea is then to split the article in 1 record per paragraph instead of a record per article and use distinct at query time to have only one result per article. This consumes more records, but scales with really long documents. As you can see in the creation date of the issue above, this has been a topic that we haven't tackled in a really long time.

Our Zendesk integration is currently in maintenance only mode, and we do not plan to add any new feature (which this would be). I you'd be interested in creating a PR for this, I'd be happy to review it, but this requires a bunch of non-trivial changes.

HurricanePete commented 5 years ago

Hello @Jerska - thank you for the reply. Our indexes average about 3kb each, so that shouldn't be a problem. Would you see any potential issues if we disabled the integration and uploaded the full article bodies from our end? We already (effectively) have a crawler in place, this would just be for reindexing on the Algolia side.

Jerska commented 5 years ago

If you're able to do the indexing on your part, by all means feel free to. The requirement is to match the extracted JSON our system indexes.

What I'm not sure to understand is how that would fix the issue. You'll be facing the same limit, and if some records are truncated today by our script, it means you already have articles above 5KB. While there is some room between 5 and 10KB, I guess we can safely assume some of them will be above limit and fail to be indexed.

HurricanePete commented 4 years ago

Ah - bit of a mix up there. I am planning on splitting by paragraph and then using the distinct feature within Algolia. This seems like the best option at this point, as I think you said the ability to do that through the algoliasearch-zendesk integration hadn't been developed.

Jerska commented 4 years ago

It makes sense.

You are correct that the integration doesn't support this at this point in time. We're open to Pull Requests, so if you want to take our code as a base for the script and submit one, it could be integrated directly in the connector.