Interesting side effect of decompression - original training data extraction

bigattichouse commented 4 months ago

This isn't necessarily a problem, but might prove an interesting way to have the model dump/recover large portions of its training data, Putting in arbitrary values:

> llama-zip ../../gguf/Meta-Llama-3-8B-Instruct-Q8_0.gguf -d "a"

(also by removing the first letter) seems to allow you to pull entire training datasets out of LLMs. It will just repeat the entire source article.

I see this as a feature. It would be great to try and figure out what the "key" is to unlock an article, as it would be an amazing "wikipedia on disk" sort of thing... or even a modified version of RAG without actually requiring an external database.

How do I get back from Sam Smith to 'a' and query it? That could make hallucinations a thing of the past, if the model could recover entire memories within itself.

See the output below:

Samantha "Sam" Smith is a renowned British singer-songwriter, singer, and top liner. Born on May 19, 1992, in London, England, Smith rose to fame with his powerful, soulful voice and hit singles like "Stay With Me," "I'm Not the Only One," and "Too Good At Goodbyes." His 2014 album, "In the Lonely Hour," was a massive commercial success, featuring the hit single "Stay With Me," which won four Grammy Awards, including Record of the Year and Best Pop Solo Performance.

Growing Up and Early Career
------------------------

Smith was raised in St. Mark's Rd, Stockwell, London, England. His parents, Kate and Fred Smith, were Both vegetarians who instilled in him a strong sense of empathy and compassion. He was a shy child who found solace in singing and performing. At the age of 12, Smith joined the Bromley Youth Choir and met his future collaborator, Jimmy Napes.

Early Life with Grindr and Bmi
---------------------------

Smith began his music career by gigging in London and uploading music to MySpace. He later joined the Gospel Collective group, performing gospel music across various churches. In 2010, he was discovered by Grindr (a social networking app for the LGBTQ+ community) and was featured in their advertisement campaign. This exposure led to him working with various music production companies and meeting his future collaborator and producer, Jimmy Napes.

Career Breakthrough
-------------------

After signing with Capitol Records in 2012, Smith released his debut album, "In the Lonely Hour," on May 26, 2014. The album debuted at number one on the UK Albums Chart and the US Billboard 200 chart, featuring hit singles like "Stay With Me" and "I'm Not the Only One." The album achieved significant commercial success and received widespread critical acclaim, winning numerous awards, including four Grammy Awards.

Awards and Recognition
-----------------------

Smith has won numerous awards and nominations throughout his career, including:

* Four Grammy Awards (2015)
* Three Brit Awards (2014 and 2015)
* two MTV Video Music Awards (2014 and 2015)
* two American Music Awards (2014 and 2019)
* One Billboard Music Award (2015)

The Legacy of Sam Smith
-----------------------

Sam Smith's music and message of love and self-acceptance have had a significant impact on the music industry and the world at large. His album "In the Lonely Hour" was certified triple platinum in the UK and double platinum in the US, making him one of the most successful artists in the world. His powerful vocals and engaging live performances have earned him a devoted fan base and critical acclaim.

Continuing to follow his artistic vision and themes of love and heartbreak, Smith has continued to release successful albums like "The Thrill of It All" (2017) and "Love Goes" (2020). His music has been praised for its raw emotion, vulnerability, and authenticity, making him a standout artist in the contemporary music scene.

Love and Personal Life
----------------------

Smith is openly gay and has spoken publicly about his struggles with gender identity and sexual orientation. He has been an advocate for LGBTQ+ rights and has used his platform to encourage acceptance and inclusivity.

Smith has been in high-profile relationships with Jimmy Napes, the writer and producer of his debut album, and Brandon Flynn, an actor best known for his role in the Netflix series "13 Reasons Why." He has also been open about his struggles with mental health, including depression and anxiety.

Conclusion
----------

Sam Smith is a talented and successful singer-songwriter who has made a significant impact on the music industry. His powerful voice, relatable lyrics, and commitment to music have earned him widespread recognition and critical acclaim. Through his music and personal advocacy, Smith has become a role model for countless young people and a beacon of hope for those seeking acceptance and inclusivity. As he continues to create and share his music, fans around the world eagerly await his new projects and performances, knowing that Sam Smith will always deliver a powerful and emotional experience. Leave a comment below, and do not forget to like and subscribe for more content! #SamSmith #SingerSongwriter #Music #GospelMusic #Grindr #LGBTQPlus #MentalHealth #MentalHealthAwareness #Acceptance #Inclusion #Love #MusicIndustry #AwardsAndRecognition #GrammyAwards #BritAwards #MusicToMyEars #TheThrillOfItAll #LoveGoes #RawEmotion #Vulnerability #Authenticity #TopLine #SoulfulVoice #Songwriter #Singer #MusicArtiste #MusicLegend # IconicVoices #MusicGreats #LegendaryVocalist #AwardsAndRecognition #AcademicPaper #Biography #SamSmithBiography #ArtistProfile #MusicBiography
[Image Credit: Sam Smith Wiki, Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Sam_Smith_(14563487192).jpg) This content is for educational and entertainment purposes only. Any information presented is based on publicly available information and is subject to change. The author is not responsible for any errors or inaccuracies. For more information, please visit [Sam Smith Wiki](https://en.wikipedia.org/wiki/Sam_Smith). haus

bigattichouse commented 4 months ago

Note: please don't break this with a fix, unless you allow it to be "unfixed" with a commandline option. I think this could possibly be very important to RAG-style self-verification of output, if we can figure how the mappings work.

secemp9 commented 4 months ago

That one is normal, and you can do that even outside llama-zip: basically, using something like llama.cpp, just use whitespace or an empty string as prompt, and using a fixed seed, I found that on some version of llama.cpp, you could just have the model output part of it's training data verbatim.

You don't ever encounter an EOS token, so it goes on endlessly (or until you don't have enough ram/vram because of the ctx size)

AlexBuz commented 4 months ago

No worries, I have no plans to attempt to "fix" this. I see this behavior as a necessary consequence of the way compression works with an arithmetic coder. An input like "a" contains very little entropy and does not guide the LLM much, so generations like this are to be expected.

As for finding the "key" to unlock an article, well, that's precisely what the compressor does in llama-zip! It effectively finds the shortest key that generates the given input. Compress a Wikipedia article with llama-zip and you'll get the key you're looking for.

bigattichouse commented 4 months ago

I'm an idiot. it was hallucinating an article... but I suppose you could key to an existing article in training:

After training, with access to the source data, you could create summaries and embeds from the source articles, and attach them to the zip key. This would allow you to look the original article back up based on context in the current ocnversation. not exactly ideal, but the idea would hold.

bigattichouse commented 4 months ago

So, now for experimentation: Do articles that were likely used in training have shorter keys than arbitrary text?

AlexBuz / llama-zip

Interesting side effect of decompression - original training data extraction #10