data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
220 stars 77 forks source link

Use postgreSQL Full Text Search instead of ElasticSearch #235

Closed blackrez closed 1 month ago

blackrez commented 1 year ago

Is your feature request related to a problem? Please describe. Use and maintain Opensearch cluster is hard and costly to maintain (the AWS serverless version cost 700.80$/month).

Describe the solution you'd like Use Full text search in postgresql

https://www.postgresql.org/docs/current/textsearch.html For example : https://amitosh.medium.com/full-text-search-fts-with-postgresql-and-sqlalchemy-edc436330a0c https://www.crunchydata.com/blog/postgres-full-text-search-a-search-engine-in-a-database

Additional context I can help to add this functionality, I'm very interested by the product but Opensearch/ES is clearly a blocking point.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

chamiles commented 1 year ago

Curious if the Opensearch serverless version that was announced a re:invent would be more cost effective if that is the blocking point. what maintenance activities make it difficult to maintain?

blackrez commented 1 year ago

Sadly, the serverless version is not cost effective. The maintenance activities makes opensearch hard to maintain : the reindex, the mapping change, upgrading cluster... I know AWS had done a great jobs to make opensearch better but for little organisation, it is hard to operate.

dlpzx commented 1 year ago

Hi @blackrez, I agree that OpenSearch indexes and mappings are hard to update in OpenSearch. Also, OpenSearch is the service that takes the bigger portion of data.all costs

You are more than welcome to work on implementing the functionality :) Fork the code and open a PR when the code is ready. Keep in mind that to make it mergeable, you will need to make the "underlying catalog database" configurable, something like adding a parameter in the cdk.json that defines whether we use RDS or Opensearch.

Feel free to add here your questions, comments or designs and we can discuss them

dlpzx commented 1 year ago

Hi @blackrez we are exploring the possibility to implement OpenSearch serverless into data.all (@kukushking). For the case of data.all, where we hardly ever update mappings or have to re-index, do you see any limitations into using OpenSearch serverless as a cost-effectiveness solution? Could you give more detail about other alternatives? Your opinion is highly appreciated :)

blackrez commented 1 year ago

Hello @dlpzx,

From my point of view, OpenSearch Serverless is not at all cost-effective.

You can configure a maximum number of OCUs for your account in order to control costs. You're billed for a minimum of 4 OCUs allocated for your workloads when you create a collection.

It will cost 0,24$4OCU730h = 700$

My main alternative is the full text search from PostGreSQL. The creation of index will be simpler (https://www.postgresql.org/docs/current/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX). I don't see any technical limitations to achieve this. Most text functions are present in and it works out of the box.

I found a lot of code about the index mapping and creation but not a lot about the search itself.

dlpzx commented 9 months ago

@InasK6, I believe your customer conducted a cost analysis on this component of data.all. Would it be possible to share the results?

InasK6 commented 9 months ago

hello @dlpzx @blackrez

Context

Indeed, the main findings from my cost analysis are that

  1. OpenSearch drives the cost of the platform
  2. from the documentation of Opensearch serverless here, we can see that:
    • When you create your first Opensearch serverless collection, 4OCUs are instantiated. All subsequent collections can share these OCUs. But one collection can, at most, scale down to 2 OCUs for indexing and 2 OCUs for search. architecture reference
    • 1 OCU = 6GB + 1vCPU + data transfer to S3
    • At the moment, the estimated cost of using these 4 OCUs is around 660$/Month

Solutions

I have considered two options to reduce these cost:

  1. Replacing the catalog search capability by RDS filters
  2. Consider a managed Opensearch cluster with smaller instances

Implementing RDS search capabilites

Advantages

Challenges

Managed OpenSearch

Advantages

Challenges

Conclusion

Tradeoffs need be considered regarding this choice. If you want to go for the RDS implementation, That would need more developer effort. I have already started implementing db functions and GraphQL objects to test on one of the filter options of the Catalog. So, I could help starting on that. If you consider the Managed OpenSearch option, extra operational or testing overhead can be needed to estimate the right size for your need, but it only needs two lines of code changes

dlpzx commented 1 month ago

We will close this issue as the moment as we are not planning to implement it. If interest grows we can reopen it and work on it.