Open sailerinteractive opened 4 years ago
@sailerinteractive as you noticed we dropped it in Search 2.0, we didn't find a way to maintain this feature in 2.0. We agree it's an important feature, we might add it in the future.
BTW, why are you keeping the documents on SQL DB and not in Redis?
Thanks a lot for getting back to me! Good to know that i understood the implications of the Search 2.0 change correctly.
I have some other code interacting with the documents through SQL and it seems to me a better option to store a large number of documents persistently in terms of data integrity and safety.
I'll stick with version 1 for a while and postpone the decision until i know which direction you are taking with the feature.
Just ran into this. While it was enough to have 16GB redis server with version 1 for our dataset with NOSAVE, the version 2 just crashed because lack for the memory.
@lizdeika Can you give some details about the context of this crash ?
Here are the kind of information that could be useful:
@emmanuelkeller number of documents - ~3,000,000 average size of a document - not sure, maybe ~3KB index schema:
{
id: { numeric: { sortable: true } },
relation_id: :text,
anything: :text, # concatenated from several fields in database to be able to do fulltext search
state: :text,
email: :tag,
relation2_id: :text,
relation3_id: :text,
sum: :numeric,
bool_value: :text,
another_state: :text,
created_at_unix_timestamp: { numeric: { sortable: true } }
}
Thanks for the details, it makes more sense now.
The main difference with v2, is that the documents have to be stored in memory as hashes in Redis. In you case, doing the math ( 3,000,000 x 3KB), there is an overhead of memory (about) 8GB.
You may use INFO memory to get an idea of the memory cost on storing the hashes
Yup, that's why NOSAVE is so good, as for search there is only index in the redis and payloads are loaded from mysql :) So the unused payload data is not stored in redis itself.
Current info memory
looks like this:
# Memory
used_memory:6786340680
used_memory_human:6.32G
used_memory_rss:7068696576
used_memory_rss_human:6.58G
used_memory_peak:6928918384
used_memory_peak_human:6.45G
used_memory_peak_perc:97.94%
used_memory_overhead:1084534196
used_memory_startup:822632
used_memory_dataset:5701806484
used_memory_dataset_perc:84.03%
allocator_allocated:6786607720
allocator_active:6986178560
allocator_resident:7124312064
total_system_memory:33677918208
total_system_memory_human:31.37G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.03
allocator_frag_bytes:199570840
allocator_rss_ratio:1.02
allocator_rss_bytes:138133504
rss_overhead_ratio:0.99
rss_overhead_bytes:-55615488
mem_fragmentation_ratio:1.04
mem_fragmentation_bytes:282396920
mem_not_counted_for_evict:2964
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:201992
mem_aof_buffer:2964
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0
This means for V2 used_memory_rss_human
would look something like 6.58 + 8 = 14.58G
?
P.S. total_system_memory_human
was 16G when I was trying out V2
I also have a use case for this. I was hoping to evaluate Redisearch to replace my Elasticsearch cluster, but the lack of NOSAVE is a non-starter for me.
I have hundreds of millions of documents averaging about a kb in size, which receive a very hefty mutation rate because these are user controlled data (a good analog is JIRA cards, for example).
My indexing pipeline is designed to tolerate latency, and I'm comfortable with indexing updates getting behind indefinitely. The content, if it's stored in the search server, is nonsense to me* because it might be behind the source of authority indefinitely. So I use the search server to get me docids of the search results, and then ask the doc store for the actual up to date content. This works really well in production today, with Elasticsearch (_source enabled false) as the search server.
RAM is already very expensive compared to disk, but it's much easier to justify the cost of changing to redisearch and away from the terribly unstable ES system if we're talking ~200 bytes of ram per doc, as opposed to ~2kb per doc. To produce those numbers I indexed a million subject docs into redisearch, with and without NOSAVE and observed the RSS memory usage
*Let's say you have a piece of data "Foo bar". You update it to say "bar baz". If you then search for "foo" in an autocomplete textbox and it shows up, you will understand -- the update takes time to propagate. But if that autocomplete entry SAYS "Foo bar" instead of "bar baz" then that's an experience issue the causes a lack of trust in the system. "I updated it to 'bar baz' ALREADY!" frustrated users say, or worse, they go back to the data to double-check whether they actually changed it to "bar baz" or not, and get very confused when it says "bar baz", but the autocomplete says "Foo bar". Customer support tells them search is behind, and they pretend to understand. But now they lose trust and attrite. Indexing might get behind, but we do not find an experience design that disagrees with our user, in this case, to be tenable. Therefore, storing _source (in ES terms) or not NOSAVEing (in redisearch terms) is pointless resource wastage, for our use case. We wouldn't care that much, if it weren't 10x more expensive to run a redisearch instance without NOSAVE, but it is, as RAM costs tend dominate all other VM costs [for machines that are appropriate for running redisearch].
@arexsutton we are considering to put this feature back in future versions, but as @emmanuelkeller mentioned above RediSearch 2.0 internal implementation was refactored compared to 1.6 so it might take time before we manage to get it back.
Meanwhile, you can keep using RediSearch 1.6 or consider using RediSearch on Redis Enterprise/Cloud. With Redis Enterprise/Cloud we support Redis on Flash extension which practically means that most of the docs memory (but the indexes) will stored on much cheaper flash.
Related: Since we're stuck with 1.6 for the foreseeable future, are there any plans to release 1.6.16+ and push the docker tags for 1.6.16+? Is there any long term maintenance plan?
Also related: Today at work (2k+ devs, in the job seeker market) I was talking with our search engine experts. I asked about indexing _source and one of them said, "I have never once indexed _source in my career, nor would I be caught dead indexing _source in a search system". The others agreed that was the most sound approach; they said indexing _source is for toy projects and non-critical systems (such as log analytics). Food for thought.
If NOSAVE functionality is not planned to be restored, then what are the use cases are Redisearch aimed at? Like, what is it for, if it's not a general purpose production search server?
I think the answer is "RediSearch on Redis Enterprise/Cloud" :)
When should we expect Redisearch supporting back the NOSAVE
option?
So far i used FT.ADD with NOSAVE as my documents are stored in an external SQL database and i only require the doc id when searching. This seemed to be a smart option since it does not require additional Redis memory for the documents. I understand the design decision to couple indexing to the standard Redis commands but i think it takes away a great use case for the module especially for large document sets. Would it be possible to keep FT.ADD as another way to interact with the index or is this just impossible with the new internal design?