ar-io / ar-io-node

A scalable and modular gateway built for the permaweb atop the Arweave permanent data storage network.
https://ar.io
GNU Affero General Public License v3.0
97 stars 65 forks source link

Implement Fuzzy Search Using PostgreSQL #137

Open foxoffire33 opened 4 months ago

foxoffire33 commented 4 months ago

As part of the project I’m working on, I need to implement fuzzy search functionality. For example, if a user searches for an artist and makes a typo, I still want to return the correct result. To achieve this, I plan to use a feature of the PostgreSQL database. Currently, I am testing this on PostgreSQL.

Additionally, I have noticed that sorting is currently broken, and I might need some help resolving this issue. The database structure will follow in a future update.

I would appreciate any feedback you might have on this approach.

coderabbitai[bot] commented 4 months ago

Walkthrough

The recent changes introduce several new SQL scripts, configuration files, and TypeScript updates to enhance the database operations for a project. These updates include creating views and indexes for efficient searches, adding dependencies, configuring data loading from SQLite to PostgreSQL, and defining database-related constants. Additionally, new SQL queries for data retrieval, deletion, insertion, and updates are added, along with TypeScript functions and types for handling Postgres database operations.

Changes

Files/Paths Change Summary
migrations/...createViewForWildcardSearch.sql Added SQL migration script to create view transactionsTagsDecoded and indexes on tag_names and tag_values.
migrations/down/...createViewForWildcardSearch.sql Added down migration commands to drop view transactionsTagsDecoded and indexes tagNamesIndex and tagValuesIndex.
package.json Added @types/pg and @types/yesql dependencies, updated pg dependency.
pgloader.conf Introduced configuration for loading data from SQLite to PostgreSQL.
src/config.ts Added database-related constants using environment variables.
src/database/pgql/bundles/... Introduced multiple SQL files with queries for various operations like data retrieval, deletion, and insertion.
src/database/pgql/core/... Added SQL files with queries for core database operations, including accessors, imports, and stats.
src/database/pgql/data/... Added SQL queries for handling content attributes and nested content.
src/database/pgql/moderation/... Introduced SQL queries for moderation checks and blocking data.
src/database/postgress/PostgressDatabaseHelpers.ts Added functions for handling transactions and data items in Postgres.
src/database/postgress/PostgressDatabaseTypes.ts Introduced types and interfaces for worker pools, debug information, and database operations.
src/database/postgress/StandalonePostgresDatabase.ts Added StandalonePostgresDatabase class for various database operations, including worker management and data handling.
src/database/sql/core/accessors.sql Added FOR SHARE clause to a subquery, affecting locking behavior in SQL queries.
src/database/standalone-postgres.ts Implemented Postgres database worker for various operations like fetching data, saving transactions, and handling errors.

Sequence Diagram(s) (Beta)

sequenceDiagram
    participant User
    participant API
    participant Database
    User->>API: Request to insert data
    API->>Database: Execute upsertNewDataItem SQL query
    Database-->>API: Data inserted/updated
    API-->>User: Response with success/failure
sequenceDiagram
    participant User
    participant API
    participant Database
    User->>API: Request to fetch data attributes
    API->>Database: Execute selectDataAttributes SQL query
    Database-->>API: Return data attributes
    API-->>User: Response with data attributes

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share - [X](https://twitter.com/intent/tweet?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A&url=https%3A//coderabbit.ai) - [Mastodon](https://mastodon.social/share?text=I%20just%20used%20%40coderabbitai%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20the%20proprietary%20code.%20Check%20it%20out%3A%20https%3A%2F%2Fcoderabbit.ai) - [Reddit](https://www.reddit.com/submit?title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&text=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code.%20Check%20it%20out%3A%20https%3A//coderabbit.ai) - [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https%3A%2F%2Fcoderabbit.ai&mini=true&title=Great%20tool%20for%20code%20review%20-%20CodeRabbit&summary=I%20just%20used%20CodeRabbit%20for%20my%20code%20review%2C%20and%20it%27s%20fantastic%21%20It%27s%20free%20for%20OSS%20and%20offers%20a%20free%20trial%20for%20proprietary%20code)
Tips ### Chat There are 3 ways to chat with [CodeRabbit](https://coderabbit.ai): - Review comments: Directly reply to a review comment made by CodeRabbit. Example: - `I pushed a fix in commit .` - `Generate unit testing code for this file.` - `Open a follow-up GitHub issue for this discussion.` - Files and specific lines of code (under the "Files changed" tab): Tag `@coderabbitai` in a new review comment at the desired location with your query. Examples: - `@coderabbitai generate unit testing code for this file.` - `@coderabbitai modularize this function.` - PR comments: Tag `@coderabbitai` in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples: - `@coderabbitai generate interesting stats about this repository and render them as a table.` - `@coderabbitai show all the console.log statements in this repository.` - `@coderabbitai read src/utils.ts and generate unit testing code.` - `@coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.` - `@coderabbitai help me debug CodeRabbit configuration file.` Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. ### CodeRabbit Commands (invoked as PR comments) - `@coderabbitai pause` to pause the reviews on a PR. - `@coderabbitai resume` to resume the paused reviews. - `@coderabbitai review` to trigger an incremental review. This is useful when automatic reviews are disabled for the repository. - `@coderabbitai full review` to do a full review from scratch and review all the files again. - `@coderabbitai summary` to regenerate the summary of the PR. - `@coderabbitai resolve` resolve all the CodeRabbit review comments. - `@coderabbitai configuration` to show the current CodeRabbit configuration for the repository. - `@coderabbitai help` to get help. Additionally, you can add `@coderabbitai ignore` anywhere in the PR description to prevent this PR from being reviewed. ### CodeRabbit Configration File (`.coderabbit.yaml`) - You can programmatically configure CodeRabbit by adding a `.coderabbit.yaml` file to the root of your repository. - Please see the [configuration documentation](https://docs.coderabbit.ai/guides/configure-coderabbit) for more information. - If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: `# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json` ### Documentation and Community - Visit our [Documentation](https://coderabbit.ai/docs) for detailed information on how to use CodeRabbit. - Join our [Discord Community](https://discord.com/invite/GsXnASn26c) to get help, request features, and share feedback. - Follow us on [X/Twitter](https://twitter.com/coderabbitai) for updates and announcements.
djwhitt commented 4 months ago

Hey @foxoffire33, cool work! I'm happy to talk through issues with this. I'm not sure whether we'll want to integrate it into develop/main in the near term, but we could maintain a separate branch for those who want to play around with it. That way we could continue to iterate on development and get some feedback on the level of interest from the community.

I'll follow up later today or tomorrow with more thoughts about the approach and other potential fuzzy search options. Thanks again for all the work you put into this!

djwhitt commented 4 months ago

@foxoffire33 I've been thinking about this more today. What I'm initially curious about is what kind of fuzzy search you want to provide and which specific Postgres features you intend to use for it. (apologies if this is visible in the diffs. I probably won't be able to review those in detail till next week) I'm also curious if you considered using the webhook support to populate some external index (e.g., OpenSearch or similar), and if not, why.

Regarding potential integration - how much effort are you willing to put into maintaining this? We would like to add Postgres support in develop/main at some point, but envisioned using a somewhat different schema from SQLite when we eventually implemented it. If you're interested in pursuing that, I could give you some info about the schema design we had in mind.

foxoffire33 commented 3 months ago

I tried to do it with Meilisearch first, but it doesn't fit the case perfectly. Implementing the functionality I need would be similar in cost to changing the gateway itself.

Since this is for a project I'm working on, I'm happy to maintain a Postgres solution for fuzzy search as long as we need it.

I'm also interested in the database design you already have in place, and I might be able to contribute some code for that.

djwhitt commented 3 months ago

I tried to do it with Meilisearch first, but it doesn't fit the case perfectly. Implementing the functionality I need would be similar in cost to changing the gateway itself.

Gotcha, just wanted to make sure you were aware of the webhook option.

Since this is for a project I'm working on, I'm happy to maintain a Postgres solution for fuzzy search as long as we need it.

Great! What's the current status of the PR? Is it working for your use case now or are you still sorting through issues?

I'm also interested in the database design you already have in place, and I might be able to contribute some code for that.

Also great! I don't think I'll get to it till next week at this point, but I'll write up some details soon.

foxoffire33 commented 3 months ago

I'm still working on it. Do not have a lot of time.

There is an issue with sorting in queries. I will provide an update when I know more.

djwhitt commented 2 months ago

Long overdue update from our side - we're working on a version of the schema I'd recommend (on an internal legacy database in a different code base). It's similar to the SQLite schema but uses a single height partitioned table in Postgres as opposed to separate 'new' and 'stable' tables. We represent null heights with a very large positive integer since partition keys can't be nullable. The advantage of this approach is that all the new/stable flushing is automatically handled by Postgres.

foxoffire33 commented 1 month ago

is it possible to share the sql schema?

djwhitt commented 1 month ago

Sure. This should give you the general idea, but keep in mind this is from a different DB with some legacy decisions (strings instead of binary fields). It's also under active development, so some details might still change.

CREATE TABLE data_items (
    height integer,
    height_partition_key integer,
    id character(43) NOT NULL,
    raw_offset bigint,
    raw_size bigint,
    signature character varying, -- only here to support ANS102
    signature_type bigint,
    signature_offset bigint,
    signature_size bigint,
    owner_address character(43),
    owner_offset bigint,
    owner_size bigint,
    data_offset bigint,
    data_size bigint,
    target character(43),
    quantity character varying,
    reward character varying,
    anchor character varying,
    content_type character varying,
    format smallint,
    created_at timestamp without time zone DEFAULT now() NOT NULL,
    updated_at timestamp without time zone DEFAULT now() NOT NULL,
    parent character(43),
    parent_root_tx_offset bigint, -- offset of the parent relative to the root TX
    root_tx_id character(43), -- same as parent when it's not a nested data-item
    PRIMARY KEY (height_partition_key, id)
) PARTITION BY RANGE (height_partition_key);

CREATE INDEX data_item_id_idx ON data_items (id);
CREATE INDEX data_item_height_part_key_desc_id_idx ON data_items (height_partition_key DESC, id);
CREATE INDEX data_item_owner_address_height_part_key_id_idx ON data_items (owner_address, height_partition_key, id);
CREATE INDEX data_item_owner_address_height_part_key_desc_id_idx ON data_items (owner_address, height_partition_key DESC, id);
CREATE INDEX data_item_target_height_part_key_id_idx ON data_items (target, height_partition_key, id);
CREATE INDEX data_item_target_height_part_key_desc_id_idx ON data_items (target, height_partition_key DESC, id);

CREATE TABLE data_item_tags (
    height integer,
    height_partition_key integer,
    tx_id character(43) NOT NULL,
    index integer NOT NULL,
    name_hash bytea NOT NULL,
    value_hash bytea NOT NULL,
    created_at timestamp without time zone DEFAULT now() NOT NULL,
    updated_at timestamp without time zone DEFAULT now() NOT NULL,
    PRIMARY KEY (height_partition_key, tx_id, index)
) PARTITION BY RANGE (height_partition_key);

CREATE INDEX data_item_tags_name_value_height_part_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key, tx_id);
CREATE INDEX data_item_tags_name_value_height_part_key_desc_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key DESC, tx_id);

CREATE TABLE data_item_tag_names (
    name_hash bytea NOT NULL PRIMARY KEY,
    name character varying NOT NULL
);

CREATE INDEX data_item_tag_names_name_idx ON data_item_tag_names (name);

CREATE TABLE data_item_tag_values (
    value_hash bytea NOT NULL PRIMARY KEY,
    value character varying NOT NULL
);

CREATE INDEX data_item_tag_values_value_idx ON data_item_tag_values (value);