blackrez commented 1 year ago

Is your feature request related to a problem? Please describe. Use and maintain Opensearch cluster is hard and costly to maintain (the AWS serverless version cost 700.80$/month).

Describe the solution you'd like Use Full text search in postgresql

https://www.postgresql.org/docs/current/textsearch.html For example : https://amitosh.medium.com/full-text-search-fts-with-postgresql-and-sqlalchemy-edc436330a0c https://www.crunchydata.com/blog/postgres-full-text-search-a-search-engine-in-a-database

Additional context I can help to add this functionality, I'm very interested by the product but Opensearch/ES is clearly a blocking point.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

chamiles commented 1 year ago

Curious if the Opensearch serverless version that was announced a re:invent would be more cost effective if that is the blocking point. what maintenance activities make it difficult to maintain?

blackrez commented 1 year ago

Sadly, the serverless version is not cost effective. The maintenance activities makes opensearch hard to maintain : the reindex, the mapping change, upgrading cluster... I know AWS had done a great jobs to make opensearch better but for little organisation, it is hard to operate.

dlpzx commented 1 year ago

Hi @blackrez, I agree that OpenSearch indexes and mappings are hard to update in OpenSearch. Also, OpenSearch is the service that takes the bigger portion of data.all costs

You are more than welcome to work on implementing the functionality :) Fork the code and open a PR when the code is ready. Keep in mind that to make it mergeable, you will need to make the "underlying catalog database" configurable, something like adding a parameter in the cdk.json that defines whether we use RDS or Opensearch.

Feel free to add here your questions, comments or designs and we can discuss them

dlpzx commented 1 year ago

Hi @blackrez we are exploring the possibility to implement OpenSearch serverless into data.all (@kukushking). For the case of data.all, where we hardly ever update mappings or have to re-index, do you see any limitations into using OpenSearch serverless as a cost-effectiveness solution? Could you give more detail about other alternatives? Your opinion is highly appreciated :)

blackrez commented 1 year ago

Hello @dlpzx,

From my point of view, OpenSearch Serverless is not at all cost-effective.

You can configure a maximum number of OCUs for your account in order to control costs. You're billed for a minimum of 4 OCUs allocated for your workloads when you create a collection.

It will cost 0,24$4OCU730h = 700$

My main alternative is the full text search from PostGreSQL. The creation of index will be simpler (https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX). I don't see any technical limitations to achieve this. Most text functions are present in and it works out of the box.

I found a lot of code about the index mapping and creation but not a lot about the search itself.

dlpzx commented 9 months ago

@InasK6, I believe your customer conducted a cost analysis on this component of data.all. Would it be possible to share the results?

InasK6 commented 9 months ago

hello @dlpzx @blackrez

Context

Indeed, the main findings from my cost analysis are that

OpenSearch drives the cost of the platform
from the documentation of Opensearch serverless here, we can see that:
- When you create your first Opensearch serverless collection, 4OCUs are instantiated. All subsequent collections can share these OCUs. But one collection can, at most, scale down to 2 OCUs for indexing and 2 OCUs for search. architecture reference
- 1 OCU = 6GB + 1vCPU + data transfer to S3
- At the moment, the estimated cost of using these 4 OCUs is around 660$/Month

Solutions

I have considered two options to reduce these cost:

Replacing the catalog search capability by RDS filters
Consider a managed Opensearch cluster with smaller instances

Implementing RDS search capabilites

Advantages

If we look at the filters currently performed from the catalog UI, we can see that they can be translated queries and do not necessitate heavy searches on indexed text
We already have the data stored in RDS, no extra cost would be incurred for storage
This would lead to a cost reduction of 100% by eliminating the Opensearch component

Challenges

The catalog UI implementation is heavily linked to OpenSearch. It uses UI components with appbaseio/reactivesearch library which generate Elasyic search queries from frontend -> to API Gateway -> Search handler Integrating RDS search capability means implementing the capability end to end from frontend to graphQL queries to RDS queries and integration with the api_handler. Therefore, extra developer time
Some features were implemented on top of the current implementation of the Catalog using the current Catalog API calls structure. Therefore, that would mean, needing to adapt every extra feature implemented. Example: Any feature that have been developped modifying the frontend, happened based on the appbaseio/reactivesearch library components syntax

Managed OpenSearch

Advantages

Developer effort: You only need to switch to deploy data.all with Managed OpenSearch by setting up enable_opensearch_serverless flag in cdk.json file to false + change two lines of code to specify the instance type of your master nodes and worker nodes Note You cannot chose M3, R3, T2 types. You will get an error saying they do not support encryption of data at rest for your Managed OpenSearch cluster
Cost reduction: We have considered multiple configurations for our workload (460 KB of data stored in our cluster)
1. 2 master 2 worker with T3.medium : $660/month to $145/month → 78% cost reduction
2. 3 master 3 worker with T3.medium : $660/month to $218/month → 67% cost reduction
3. 3 masters and 2 workers with T3.medium : $660/month to $194/month → 71% cost reduction

Challenges

You might need to perform load testing to estimate the right instance size
Operational Overhead: As you mentionned, since it is not serverless, you handle the reindex, the mapping change, upgrading cluster... It worth considering at what frequency do you need to do this
Rollback: If you have already deployed your platform with OpenSearch Serverless, you will need extra effort to migrate back to Managed OpenSearch (ex: copying the data etc...)

Conclusion

Tradeoffs need be considered regarding this choice. If you want to go for the RDS implementation, That would need more developer effort. I have already started implementing db functions and GraphQL objects to test on one of the filter options of the Catalog. So, I could help starting on that. If you consider the Managed OpenSearch option, extra operational or testing overhead can be needed to estimate the right size for your need, but it only needs two lines of code changes

dlpzx commented 1 month ago

We will close this issue as the moment as we are not planning to implement it. If interest grows we can reopen it and work on it.

data-dot-all / dataall

Use postgreSQL Full Text Search instead of ElasticSearch #235

Context

Solutions

Implementing RDS search capabilites

Managed OpenSearch

Conclusion