cc-archive / open-ledger

Prototype code and examples for work on the Creative Commons "CC Search" project
MIT License
48 stars 23 forks source link

Evaluate PostgreSQL vs Elasticsearch for our use case; come up with better replication scheme if we stick with Elasticsearch #237

Closed aldenstpage closed 6 years ago

aldenstpage commented 6 years ago
aldenstpage commented 6 years ago

I set up PostgreSQL FTS with our full dataset (10MM images) and got promising results. I was able to achieve good performance searching the entire database on my laptop. It's not as fast as Elasticsearch, but it is fast enough (10 to 50ms queries) and there is no separate indexing step, as opposed to Elasticsearch's 3 day indexing stage. See the psql-fulltext branch for details on how this is implemented. It would definitely be a big win (cost savings on Elasticsearch Service + simpler operation + simpler architecture) if we can use Postgres for everything.

The biggest risk is that our database will grow and Postgres FTS won't be able to handle it, at which point we'll just have to switch Elasticsearch back on. This would be a really lousy outcome, so I'm doing some load testing to figure out Postgres FTS limits. I am working on a script that fills up our development database with 300MM mock image records.

Depending on the results of the load test, I will either migrate our old Elasticsearch queries to PostgreSQL or start implementing some of my ideas to improve our Elasticsearch replication pipeline.

aldenstpage commented 6 years ago

Postgres FTS works well on hundreds of millions of records. We could definitely make the switch if needed. However, I am concerned about how well this will scale to many concurrent users in the future. Sticking with Elasticsearch will give us more flexibility with scaling. It's also a lot more responsive when we're returning large numbers of search results at once, which the API may need to support. Issues with indexing speed can be addressed by using their bulk insert API properly and tuning configuration. If AWS Elasticsearch Service gets too expensive in the future, there are some other avenues to reduce costs by tuning cluster size, or even self-hosting it on EC2.

TL;DR Postgres FTS may not perform adequately when traffic spikes, as we have seen in the past when media outlets or social media pick up our projects. Let's stick with Elasticsearch.