chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.72k stars 1.23k forks source link

[PERF]: remove mutex around tokenizer #2735

Closed codetheweb closed 1 month ago

codetheweb commented 1 month ago

Description of changes

Saves around 1,300ms when ingesting 15k documents (averaging 1000 chars each).

We spend 1.5x the tokenization time just cloning the tokens, which should be avoidable but is not included in this PR because the lifetimes get hairy and the % that tokenization takes in the total compaction time is currently fairly low.

Summarize the changes made by this PR.

Test plan

How are these changes tested?

Covered by existing tests.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

n/a

github-actions[bot] commented 1 month ago

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation