Open Aidenzich opened 7 months ago
Why (Motivation) | What (Action/Method) | How (Justification/Outcome) |
---|---|---|
Concern over privacy threats in vector databases, questioning if original text can be reproduced from embeddings. | Developed Vec2Text, a multi-step method for embedding inversion, aiming to reconstruct input text from embeddings. | Demonstrated that text embeddings reveal substantial information about the original text, recovering 92% of 32-token inputs exactly, thus highlighting privacy concerns. |
Need to understand if embeddings, used widely for efficiency in large language models (LLMs), compromise the privacy of the original text. | Investigation into the capability of embedding inversion to recover full texts, employing controlled generation for accurate reconstruction. | Found that embeddings can indeed compromise privacy by allowing the recovery of sensitive information like full names from clinical notes, necessitating the same privacy measures as for raw data. |
The growing popularity of vector databases for storing dense embeddings without a comprehensive exploration of their privacy risks. | Examined the potential of inverting dense text embeddings to the original text, challenging the non-triviality of inverting neural network outputs. | Showed that with sufficient input-output pairs from a network, it's possible to approximate the network’s inverse, thus revealing private information embedded within. |
In addition to the primary expression for the optimization problem in text recovery from embeddings, the document discusses a recursive model formulation for Vec2Text. This recursive model iteratively refines hypotheses about the original text. Here are the key mathematical expressions related to this model:
$$ p(x^{(0)} | e) = p(x^{(0)} | e, \emptyset, \phi(\emptyset)) $$
$$ p(x^{(t+1)} | e) = \sum_{x^{(t)}} p(x^{(t)} | e) \cdot p(x^{(t+1)} | e, x^{(t)}, \phi(x^{(t)})) $$
$$ \text{EmbToSeq}(e) = W_2 \cdot \sigma(W_1 \cdot e) $$
These expressions illustrate the theoretical foundation of the Vec2Text model's approach to iteratively refining text hypotheses to recover the original text from its embedding. The process involves generating an initial hypothesis, applying recursive corrections, and utilizing embedding projections to adapt the embedding vector for iterative refinement.