YeonwooSung / ai_book

AI book for everyone
24 stars 5 forks source link

Text Embeddings Reveal (Almost) As Much As Text #56

Open YeonwooSung opened 9 months ago

YeonwooSung commented 9 months ago

paper, code

Abstract

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a naïve model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover 92% of 32-token text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes.

Personal Thoughts

Reconstructing and recovering the original texts from the text embeddings might be considered as AI-based vulnerability, which could cause unintended privacy leakage issue.

스크린샷 2023-12-17 오후 9 49 25

The paper stated that it is possible to decrease the recoverage of Vec2Text by adding some Gaussian noise directly to each embedding.

As you could see in the chart above, there is a some point that we could maximize the distance between Vec2Text recovery percentage and Retrieval performance (keep the Retrieval performance and drop the recovery probability drastically).

Makes me to remind that adding "proper" noises to embeddings improves the AI-based systems!