Support: How to filter/search results?

alvaroloes commented 5 years ago

Hi,

I've been reading all the Noms documentation available because it could fit for a project in my company. I love the concept and I think it has huge potential. So far I have confirmed all the characteristics we need except one: querying and filtering list, maps and sets.

Imagine I have a list of persons stored in Noms. The person data would be:

type Person struct {
    Name string
    Age int
    IsFemale bool
}

Let's say I have a dataset called persons which contains a list of a million persons.

How would I find "All women with an age between 20 and 30 called María"?

Here are my guesses with my concerns. Could you please let me know which one would be the most suitable approach?:

Use the iterAll method with an anonymous function that will check if the current person matches the query and then store it in a "results" slice. Concern: This is pretty inefficient as I'm traversing the whole List element by element.
Load the list into my Go application memory and do the search more efficiently. I could index the results as I load the list and, eventually, do a binary search. Concern: I would say this option is even less efficient than the first one and doesn't feel right.
Store several maps in the dataset that serve as indexes for the list. I could store maps that index the Persons by age, by name or both. Concern: This would effectively increase the efficiency, as it is the same thing an index in other DB engines does, but:
- Is there a way to avoid duplicating Person data in every map (maybe using List<Ref<Person>> as map value)?
- How do I keep the index up to date? Should I do it manually? (Imagine I change a person data)
- What if I insert a new Person? Should I insert it in the List and in all the maps I have as indexes?
Use Ngql. It seems that there is support of GraphQL to query data in a Noms database, but I'm a bit confused about this.
- On the one hand, there are no examples about how to use it with a Noms database and I have checked the repo and it seems it is based in a very old version of GraphQL.
- On the other hand, I think using GraphQL won't avoid us to implement an efficient search by ourselves (using any of the points I talked above), as we need to provide the resolvers.

Could you please point me in the right direction? Thank you very much!

aboodman commented 5 years ago

Short answer:

You can do a full scan (e.g. using iterAll) or manually implement indexes using Maps and Sets.
Noms doesn't have automatic index maintenance yet
Look at the nomdex sample program for an example of building and maintaining indices

Long answer:

You've done a great job understanding Noms from somewhat poor documentation. Nice.

1, 2, and 4 are all basically equivalent from an efficiency perspective. They are the same as "full table scans" in a relational database. 4 is barely implemented, so I don't recommend it.

3 would be the right way to do it more efficiently than a full scan. Within 3, you are also correct that there are several options:

Map<K, Person> - This is equivalent to a covered index. It is the fastest because it does the fewest reads, but uses the most space because the data is copied into the index. Note that if Person is a large struct, only about 4kb of it on average per-value will be copied into the index. The rest will be shared with other instances of the same Person elsewhere in the db automatically.
Map<K, PersonID> - Assuming that you have another map somewhere like Map<PersonID, Person>, this is analogous to a traditional secondary index in a database. It uses only a little extra space. "Join" can also be efficiently implemented so that this will end up only do a small amount of extra reads.
Map<K, Ref<Person>> - Assuming you also have another map somewhere like Map<PersonID, Ref<Person>> this is most space efficient but slowest because every person fetched from either the index or primary table will be its own read.

alvaroloes commented 5 years ago

Thank you very much for answering so quickly and in detail.

I don't know how I missed the nomdex example! It is exactly what I was looking for before submitting the issue.

So here are some more questions:

As shown in the nomdex example, manual indexes cover searches and I can infer that also basic sorting and pagination. However, it seems that, for example, to get the "first 10 people called Ted" I would need to fetch all the people called Ted (which can be a lot) and then take the first 10 at the application level. Auto-suggestion: Maybe I could avoid fetching all the records and get just the amount I want by doing the following:
- Search the index (which is already sorted by name) and get the first person matching the criteria
- Use IterAt or IterRange checking again the criteria in the callback function and stopping when I have the amount I wanted.
Is there any plan to implement a query system that supports at least filtering, sorting and pagination?
Is there any plan to add automatic creation/maintenance for indexes? From your previous comment "Noms doesn't have automatic index maintenance yet", it seems so. If this is correct, any idea of when it would be available?

Don't get me wrong, I don't want to put any kind of pressure. I just love the concept and the project and want to have some idea of the status and the roadmap. It would be very useful for several projects in my company too.

I'm a developer and am willing to book some spare time to help with any of the above features 😉

aboodman commented 5 years ago

As shown in the nomdex example, manual indexes cover searches and I can infer that also basic sorting and pagination. However, it seems that, for example, to get the "first 10 persons called Tedd" I would need to fetch all the persons called Tedd (which can be a lot) and then take the first 10 at the application level. Auto-suggestion: Maybe I could avoid fetching all the records and get just the amount I want by doing the following: Search the index (which is already sorted by name) and get the first person matching the criteria Use IterAt or IterRange checking again the criteria in the callback function and stopping when I have the amount I wanted.

Yep this is how you would do it. (this is exactly how it would be implemented by a query system too).

Is there any plan to implement a query system that supports at least filtering, sorting and pagination?

No, just vague ideas.

Is there any plan to add automatic creation/maintenance for indexes? From your previous comment "Noms doesn't have automatic index maintenance yet", it seems so. If this is correct, any idea of when it would be available?

Again, not really.

Typically applications that use Noms handle this at the application layer because their data model is much more restricted than Noms' so it's quite a lot easier.

I'm open to it, I just don't know what the general design looks like right now.

Can you say more about your use case and what you're looking for? Perhaps I can help you build something specific for that use case, and it can also feed into what a general system might look like.

alvaroloes commented 5 years ago

Thank you!

Can you say more about your use case and what you're looking for? Perhaps I can help you build something specifically for that use case, and it can also feed into what a general system might look like.

We are creating a project that follows an event-driven architecture similar to CQRS. There are many parts in the system, but one of them is the "events store": A record of every fact that has happened in the system.

When those facts occur, they are emitted as events (e.g. "user-created") and are stored in the "events store". It acts as the real "source of truth" and must be event-sourced. What I mean with this is that we should be able to traverse the flow of events and stop at any moment in time to know the status of the system at that moment.

There are several approaches to achieve this, but Noms seems like a good fit in this scenario (that's what I want to confirm) as we can go back to any previous commit and already have the system status ready (instead of "reducing" all the events from the latest snapshot to the moment we are interested in).

One more question: Which is the kind of projects where Noms naturally fits as it is right now? I'm especially interested in the use-cases you had in mind that motivated the creation of Noms.

Thanks for being so responsive!

aboodman commented 5 years ago

There are several approaches to achieve this, but Noms seems like a good fit in this scenario (that's what I want to confirm) as we can go back to any previous commit and already have the system status ready (instead of "reducing" all the events from the latest snapshot to the moment we are interested in).

Exactly.

Which is the kind of projects where Noms naturally fits as it is right now? I'm especially interested in the use-cases you had in mind that motivated the creation of Noms.

Noms was originally intended to support a particular application which never saw the light of day (something sort of similar to Perkeep).

I've lately come to learn about this 'event sourcing' pattern, and I do think there are some natural places for Noms to fit there.

attic-labs / noms

Support: How to filter/search results? #3845