Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.86k stars 450 forks source link

BeyondFederated - truly decentralised learning at the edge #7254

Closed synctext closed 3 months ago

synctext commented 1 year ago

Started full-time thesis around april/may 2023.

Track DST, Q3/4 start. Still "seminar course" ToDo. Has superapp/MusicDAO experience. Discussed as diverse as digital Euro and Web3 search engine (unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance). Done machine Learning I class. (background: Samsung solution, ONE (On-device Neural Engine): A high-performance, on-device neural network inference framework.

Recommendation or semantic search? Alternative direction. Some overlap with the G-Rank follow-up project. Essential problem to solve: learning valid Creative Commons Bittorrent swarms.

class Seq2SeqEncoder(d2l.Encoder):
    """The RNN encoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
                          dropout=dropout)

    def forward(self, X, *args):
        # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
        X = self.embedding(X)
        # In RNN models, the first axis corresponds to time steps
        X = X.permute(1, 0, 2)
        # When state is not mentioned, it defaults to zeros
        output, state = self.rnn(X)
        # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
        # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
        return output, state

Second sprint (strictly exploratory):

Doing information retrieval msc course to prepare for this thesis

Literature survey initial idea: "nobody is doing autonomous AI" {unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance}.

synctext commented 1 year ago

ToDo: register https://mare.ewi.tudelft.nl/project

Latest work by TUDelft: MoDeST: Bridging the Gap between Federated and Decentralized Learning with Decentralized Sampling

quintene commented 1 year ago

To create a suggestion model with neural hashes using metadata as input to find songs in Creative Commons BitTorrent swarms:

  1. Collect (Scrape) a dataset of songs and their corresponding metadata from Creative Commons BitTorrent swarms.
  2. Use the neural network to generate neural hashes for each song in the dataset. These neural hashes would represent each song as a high-dimensional vector that captures its features and characteristics.
  3. DOING: Research in the distribution of hashes; possible directions: 3.a Each node would have a copy of the neural network and the neural hashes for some subset of the songs. The distribution of songs could be done based on some criteria like proximity or similarity of neural hashes, for example. 3.b Use of a multi-index hashing scheme.
  4. When a user wants to search for songs based on a given input genre, the query would be broadcasted to all nodes. Each node would then perform a nearest neighbor search on its own subset of the neural hashes to find the songs that are most similar to the input metadata.
  5. The results from each node would be collected and combined to generate a list of suggested songs. This could be done by taking the top N results from each node, and then combining them based on their relevance or popularity.

Optionally, Improve the model over time, track which songs are actually downloaded or listened to by users, and use this data to train the model to improve its suggestions.

synctext commented 1 year ago

Proposal: a dedicated sprint to implementing a basic search engine.

quintene commented 1 year ago

https://colab.research.google.com/drive/1j_voFtr6j0gEStsMfcafi9FV5XJOLxjj?usp=sharing

1) Scraping metadata [ "Cullah Firebird electronic folk soul", "Serious Mastering Ego electronic", "Serious Mastering La chaleur du soleil electronic", "Oxidant Deconstruct hardcore.punk powerviolence punk", ... ] 2) Translate into embeddings

3) compare embeddings using cosimalarity query: ['Firebird'] similarity score: 0.6434079439455619 Cullah Firebird electronicfolksoul

query: ['electronic'] similarity score: 0.4482832368649311 Serious Mastering Ego electronic

similarity score: 0.3406708597897247 Serious Mastering La chaleur du soleil electronic

synctext commented 1 year ago
quintene commented 1 year ago

APK including:

wetransfer link (118mb) https://we.tl/t-pnugzyNiRV

synctext commented 1 year ago

Question: how impressed/intimidated/confused are you about recent ML/LLM/Diffusion explosion? @quintene answer: innovation speed is fast/sophisticed due to everybody building on top of each other. Johan note: What does a leaked Google memo reveal about the future of AI? Question: how to identify and follow a long-enduring winner? 1) Alpaca on Pixel7, or 2) MLC Android or 3) https://github.com/BlinkDL/RWKV-LM @quintene answer: Nobody has solved the magic architecture of decentralised learning! Personalised model, how to partition, can we re-use the "decentralisation layer" across the whole ML domain? Current limited approach: one dataset, one application. "dynamic distributed learning". johan note goal: non-i.i.d?

quintene commented 1 year ago

Have not found resources that does not have a central server; Is it?

https://we.tl/t-0ffNeOjjJO

{ "artist": "Cullah", "title": "Firebird", "author_image": "https://images.pandacontent.com/artist/12/250x250/2-cullah.jpeg?ts=1675708015", "author_description": "MC Cullah is a producer/singer/songwriter/rapper from Milwaukee, Wisconsin. His music is lost somewhere in between Rock -n- Roll, Electronica and Hip Hop with a pinch of psychedelic melodies. With an arsenal of synthesizers and a library of forgotten sounds he manages to create something that sparks imagination and wonder.", "author_upcoming": [ { "context": "https://schema.org", "type": "MusicEvent", "startDate": "2023-06-15T00:00:00+00:00", "offers": "https://www.songkick.com/concerts/41175136-cullah-at-radio-milwaukee-889-fm", "name": "Radio Milwaukee 88.9 FM", "location": { "type": "PostalAddress", "addressLocality": "Milwaukee, WI, US" } ], "year": "2022", "tags": [ "electronic", "folk", "soul" ], "artwork": "https://images.pandacontent.com/release/779/250x250/1-firebird.jpeg?ts=1675708399", "magnet": "magnet:?xt=urn:btih:O2NCAP26N63U7VK6LSCXNVR3VV3ODILA&tr=udp%3A//tracker.pandacd.io%3A2710&dn=Cullah%20-%20Firebird%20%282022%29%20-%20MP3", "songs": [ "The Feather", "Firebird Credits", "The Golden Apple", "The Anima", "The King" ] },

synctext commented 1 year ago
quintene commented 1 year ago

thesisproblem_iteration_qvaneijs.pdf

synctext commented 1 year ago

Ideal sprint outcome for 15 Aug: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!

quintene commented 1 year ago

Modelling "TFLite Searcher model" using subset of dataset using only: Title, Artist, Genre/Tag, Album.

Training goal; creating vectors with similar atrributes having smaller distances.

Porting everything into Kotlin.

synctext commented 1 year ago
synctext commented 1 year ago

Pleas ensure to cite this work in your thesis, AI Benchmark: All About Deep Learning on Smartphones in 2019. website of ETH-Z en-device AI benchmarking, includes S23 results.

UPDATE: Youtube contrains more content than FMA and PandaCD. Great datasets exist. See Youtube player you could connect to your thesis focus of BeyondFederated content search with actual playable content.

Please load this [```URL_Youtube``` into Kaggle and check it out](https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube). 20230 unique music videos to recommend by 2079 artists! This would impact your work and disconnect it more of the MusicDAO code. {brainstorm input: any Youtube&magnet playback of both video or music. [116098 "music video URLs" inside this Youtube-8M dataset](https://research.google.com/youtube8m/explore.html) with annotations from a diverse vocabulary of 3,800+ visual entities for semantic search}
quintene commented 1 year ago

Neural Instant Search for Music and Podcast

Finished the model design where the final .tflite model will exists from an embedder model and an ScaNN layer. Translating Dataset Key + meta data setup would transform users query input(title, genre, author) into a vector and search for closest vectors available in the network.

The model output conists of clostest neighbors including all the metadata of the dataset. Also exploring replacing scentence encoder model with (song)object embedding model. Model updates are available on users end device, next goal would be to distribute model changes in fully decentralizedFL. https://blog.tensorflow.org/2021/11/on-device-training-in-tensorflow-lite.html

However current status of being stucked implementing The TFLite Model Maker library for on-device ML applications; Creating first version of the model. This is just needed to translate the collected datasets into a .tflite model, can't get it running currently looking for other solutions... https://pypi.org/project/tflite-model-maker/

Goal: Self learned semantic network with 100k items

Todo: mention research in paper: https://arxiv.org/abs/1908.10396

synctext commented 1 year ago
quintene commented 1 year ago

TFLight Model Maker: Not being able to build tflight_model_maker since a lot of dependencies where conflicting. Resolved by custom Dockerfile with manual build steps including other libraries. (also repo update since 2 weeks)

image

Model Image metadata of ScaNN layer

{ "associated_files": [ { "name": "on_device_scann_index.ldb", "description": "On-device Scann Index file with LevelDB format.", "type": "SCANN_INDEX_FILE" } ] }

Key decision Learning: Determine what is exactly learned by connected clients (A rebuilded custom index vs clicklig gradients/recommendations within search higher ranked items based on Clicklog (a.k.a popular audio higher ranked))

Decentralized Learning todo;

synctext commented 1 year ago
quintene commented 1 year ago

image

image

Goal for upcoming days: Scale Scalable Nearest Neighbors. Create the first ScaNN indexer without needing to rebuild a new index based on the whole dataset and only with the ScaNN library running in some python code. The indexer will:

synctext commented 1 year ago
quintene commented 1 year ago

Realizing this will be a big contribution towards Tflite-support

fat Android build for multiple architectures: x86,x86_64,arm64-v8a,armeabi-v7a Succesfull builds/compiles target /java/src/java/org/tensorflow/lite/task/text/task-library-text.aar which includes custom API tasks such as extend index convert to buffer and replace in model metadata + Pack the associated index files

synctext commented 1 year ago
quintene commented 1 year ago

Extending TFLite Support with custom API calls (On-Device Scann C++).

Currently focussing on training ScaNN, a single layer K-Means tree is used to partition the database (index) which I'am now being able to modify. The model is trained on forming partition centroids (as a way to reduce search space). In the current setup new entries are pushed in the vector space but the determination on which partition they should appear (closest to certain partition centroids) is hard.

Job to be done; Rebuilding partions.

INDEX_FILE E_X,: Which is an actual partition including compressed vectors INDEX_CONFIG: Config of embedding dimensions etc. M_Y; Metadata entry

For a dataset N X should be around SQRT(N) partitions to optimize perfromance.

No train method exposed in current model setup so another API call to expose; Either

Non perfect insert works until approximately 100k items, where new embeddings are inserted to closest partition centroids.

Older nearest neighbor paper by google

In case of On-Device limitiation of recreating the whole index including the new partitions and centroids, interesting research direction Fast Distributed k-Means with a Small Number of Rounds.

Research question shift towards: "How can the efficiency and effectiveness of SCANN be enhanced through novel strategies for dynamically adding entries, specifically focusing on the adaptive generation of K-Means tree partitions to accommodate evolving datasets while maintaining optimal search performance?"

This research question addresses the challenge of adapting SCANN, a scalable nearest neighbors search algorithm, to handle dynamic datasets. The focus is on developing innovative approaches for adding new entries in a way that optimizes the generation of K-Means tree partitions, ensuring efficient search operations as the dataset evolves.

"Evolving datasets" key in a fully decentralized (On-Device) vector space, no central entity to re-calculate all the necessary partioning/indexing.

TODO for next sprint; Focus on frozen centroids and imperfect inserts. !Keep it simple!

Also implement recommandation model; The main objective of this model is to efficiently weed out all candidates that the user is not interested in. In TensorfFlow recommender, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the titles of top entries for that user.

For Searching searching the vector space with a given query will retrieve all top-k results. Next we not only use this data for retrieving top items but to also train our User-Song recommandation model.

We then train our loss function based on: {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}

    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the songmodel, 
    # getting embeddings back.
    querylog_song_embedding = self.song_model(features["Youtube-clicked-title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, querylog_song_embedding )
synctext commented 1 year ago

update: idea for experimental results. Exactly show how insert/lookup starts to degrade as you insert 100k or 10 million items. Cluster become unbalanced, too big, too distorted from centroid?

quintene commented 11 months ago

Goal:

  1. .TFLite Model is initialized into Kotlin
  2. Model metadata (flatbuffers) parsed exposed underlying C++ bindings.
  3. Partions + index_config.txt + metadata is parsed into LevelDB c++.
  4. New item is shared in the decentral network, item should be: 4.a Embed/Vectorize item (CHECK) 4.b AH quantize vector (CHECK) 4.c find clostest centroid (CHECK) 4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE) 4.e add metatdata into key value entry
  5. Overwrite(or append) metadata model ->flatbuffers.

Slowly progressing due to complexity, not just append new item to partition array + C++ and though development enviroment.. For now focus on last development sprint on "ndexing new embeddings" otherwise come up with other alternatives.

Youtube Iterate trough music category: Analysis of dataset of millions of songs (150mb? -> Device ready!) https://developers.google.com/youtube/v3/docs

synctext commented 11 months ago
quintene commented 10 months ago

*Target of past weeks (including some time off on Holiday): Non-Perfect insert

Only 2.4 MB for 20K items Trained cluster Config!! Seeing valueable possibilities here! Such as sharing configs with peers?? Dynamic/sharable vector spaces in distributed context. Self learned or also keep sharing configs.

image

-[x] Searching Does still work under new custom build library.

ezgif-5-3136367c23

-[x] Gossip of new items/Clicklog also possible.

https://we.tl/t-hIE6pLWXDU

Different Encoder layers possible within On-Device model; Current implementation includes embeddings based on Universal scentence encoder

Meaning encodings are distributed on based semantics but not typos! Meaning Red Red Wine, will result in UB40 - Red Red Wine Red Red Wyne will not result in UB40 - Red Red Wine
But then Blue Wine will result in UB40 - Red Red Wine

synctext commented 10 months ago
quintene commented 10 months ago

Potential extended gossip design: JSON gossip replaced by gossiped C++ vector/embedding?? {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature} -> std::vector

Experiment on large tiktok dataset-> https://developers.tiktok.com/products/research-api/

synctext commented 10 months ago
Date Youtube new videos upload rate
January 2009 15 hours of video / min
2019 500 hours / min
quintene commented 9 months ago

Results of top 5 items:

id: 1255 with distance: -2.3841
id: 17372 with distance: -0.777172
id: 1077 with distance: -0.761045
id: 1078 with distance: -0.748582
id: 7886 with distance: -0.740518
synctext commented 9 months ago

update: fun fact, Deepmind also uses the library you use :smile: Improving language models by retrieving from trillions of tokens

quintene commented 8 months ago

image

3. Working custom Index with new data inserted in closest partition!

image

storyline.graduation.pdf

synctext commented 8 months ago
quintene commented 7 months ago
synctext commented 7 months ago
quintene commented 7 months ago

image

image Results fluctuate much having indipendent runs.

main.pdf

image

synctext commented 7 months ago
quintene commented 6 months ago

2,063,0 ezgif-7-f6743efe39 66 Dataset

Search term -> Query:

synctext commented 6 months ago
quintene commented 6 months ago

main.pdf

https://research.google/blog/soar-new-algorithms-for-even-faster-vector-search-with-scann/

synctext commented 6 months ago
quintene commented 5 months ago

main.pdf

TODO: [ ] Figues still need to fixed position Floating HELL [ ] streamline terms used: Entries = New Songs, test dataset for insert remove dont call "fantasy" songs.. [ ] Replace default scann picture by how song -> Embedding ->Quantinaztation -> Hashing -> CLostest Neighbor. [ ] Conclusion -> Merge all results together. Potentially SCANN but not with non perfect insert. [ ] Abstract [ ] Small section about future work not too heavy [ ] Network experiment needs a bit more context what the results show for Beyond Federated.

synctext commented 5 months ago

update:

comments:

quintene commented 5 months ago

Presentation_structure.pptx