Open-Book-Genome-Project / sequencer

A toolchain of tasks for sequencing and fingerprinting book fulltext
https://bookgenomeproject.org
43 stars 14 forks source link

Classification using vector embeddings #96

Open finnless opened 1 year ago

finnless commented 1 year ago

High Level approach: Module that creates sentence embeddings for every book. This could enable semantic search, clustering, recommendations, anomaly detection, diversity measurement, classification using distance function and could be first step to a “talk to books” or “talk to library” feature.

Disadvantage: Distance functions operate in the high-dimensional space of embeddings and can be computationally expensive, especially for large-scale book datasets.