Benchmarking Lexer Against Hugging Face Transformer

p3nGu1nZz commented 6 months ago

Benchmarking Lexer Against Hugging Face Transformer

Objective:

To evaluate the performance and effectiveness of our custom Lexer in comparison to the Hugging Face transformer, we will create a benchmark that measures speed, memory usage, and the quality of context representation.

Tasks:

[ ] Set up a Python environment with the necessary Hugging Face transformers library and dependencies.
[ ] Develop a Python script to tokenize and vectorize text using the Hugging Face transformer.
[ ] Include additional context calculations in the Python script, such as entropy, whitespace, variance, etc.
[ ] Create a mechanism within Unity to call the Python script and capture its output.
[ ] Design the benchmark to measure the processing time, output size, and context quality for both systems.
[ ] Ensure the benchmark tests are repeatable and consistent across multiple runs.
[ ] Document the benchmark process, including setup, execution, and result interpretation.

Acceptance Criteria:

The benchmark should accurately measure and compare the performance of our Lexer and the Hugging Face transformer.
Results should highlight the strengths and weaknesses of each approach in terms of speed, efficiency, and context representation.
The benchmarking process should be well-documented and easily reproducible for future testing and development.

This ticket will guide the development of a comprehensive benchmarking suite that will inform our decision-making process regarding text processing tools within our project.

Josephrp commented 6 months ago

i want to follow along with this but dont know how much i can help ^^

p3nGu1nZz commented 6 months ago

i want to follow along with this but dont know how much i can help ^^

you could make a simple python script to tokenize a string of words (using huggingface transformers) no more than 1000 characters. And track how long it takes to tokenize that string as accurately as possible.

NaXzyu / Dialogos