bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 49 forks source link

Explore options for fuzzy-match and search suggestions #605

Open danielballan opened 10 months ago

danielballan commented 10 months ago

The built-in MapAdapter and external databroker.mongo_normalized adapter supports the FullText query. We will add support for FullText in the built-in SQL-backed Catalog Adapter in #456, #457 for SQLite and PostgreSQL respectively.

Next, we should consider fuzzy match and search suggestions. This has often been done with the ELK stack, but that is a heavy stack to take on for the sake of just one of its features. What are our options?

@Kezzsim highlighted the project typesense, which is exactly targeted at serving this use case without taking on the weight of ELK.

Also, I believe there is some functionality in this space available in SQLite and PostgreSQL. While not at the level of ELK, it would be good to understand precisely how far we can get with the tech stack we already have, and what its limitations are.

danielballan commented 8 months ago

In discussions with @Kezzsim, we are going ahead with TypeSense, as an optional add-on in the same way that Prometheus is an optional add-on.

I think that this will involve:

  1. Adding a new optional argument typesense to the Catalog constructors, which takes None (default---no typense) or a config dict like
{
  'api_key': 'Hu52dwsas2AdxdE',
  'nodes': [{
    'host': 'localhost',
    'port': '8108',
    'protocol': 'http'
  }],
  'connection_timeout_seconds': 2
}

https://github.com/bluesky/tiled/blob/c76d1b3bf0468df8497568dfd9d6580207479a40/tiled/catalog/adapter.py#L1135-L1169

Tiled config like:

trees:
 - tree: catalog
   args:
     uri: postgresql+asyncpg://...
     typesense:
       api_key: $TYPESENSE_API_KEY
       nodes:
         - host: localhost
           port: 8108
           protocol: http
      connection_timeout_seconds: 2

will just work, with no code changes to the config parser.

  1. Passing that config dig into Context.__init__ and creating an instance of a typesense.Client held as self.typesense_client on the Context.

https://github.com/bluesky/tiled/blob/c76d1b3bf0468df8497568dfd9d6580207479a40/tiled/catalog/adapter.py#L111-L119

  1. Also in Context.__init__, registering [after_insert] (https://docs.sqlalchemy.org/en/20/orm/events.html#sqlalchemy.orm.MapperEvents.after_insert) and after_update SQLAlchemy events that make the relevant calls from self.typesense_client. (I remain not entirely clear what these hooks give you access to, but the docs look promising.)

  2. Adding a new module tiled.commandline._typesense and updating tiled.commandline.main to add a tiled typsense subcommand to the CLI. I imagine we will need:

tiled typesense init TYPESENSE_URL [ANOTHER_TYPESENSE_URL] # define schemas
tiled typesense rebuild TYPESENSE_URL [ANOTHER_TYPESENSE_URL]  # drop data (if any) and rebuild

The utility urllib.parse.urlparse can be used to get from a CLI-friendly string like http://localhost:8108?api_key=Hu52dwsas2AdxdE into the structure:

{
  'api_key': '',
  'nodes': [{
    'host': 'localhost',
    'port': '8108',
    'protocol': 'http'
  }],
  'connection_timeout_seconds': 2
}
danielballan commented 8 months ago

All of above is up for a rethink, just meant as a quick sketch to highlight the relevant sections of the Tiled code that I can see will need to be touched.

danielballan commented 7 months ago

From discussion on 20 Feb:

typesense_ingestion:
 - spec: BlueskyRun
   fields:
   - name: detectors  # field name in TypeSense
     path: "start.detectors"  # path into Tiled JSON metadata
     # Also type?
 - spec: SomeOtherThing
   ...
danielballan commented 7 months ago

https://github.com/bluesky/event-model/blob/main/event_model/schemas/run_start.json

danielballan commented 7 months ago
# config.yml
authentication:
  # The default is false. Set to true to enable any HTTP client that can
  # connect to _read_. An API key is still required to write.
  allow_anonymous_access: false
  single_user_api_key: "secret"  # for dev
trees:
  - path: /
    tree: catalog
    args:
      uri: "sqlite+aiosqlite:///:memory:"
      # or, uri: "sqlite+aiosqlite:////catalog.db"
      # or, "postgresql+asyncpg://..."
      writable_storage: "data/"
      init_if_not_exists: true
      typesense_client:
        schema:
        connection_info:
$ tiled serve config config.yml
danielballan commented 2 months ago

image