Closed synctext closed 3 months ago
ToDo: register https://mare.ewi.tudelft.nl/project
Latest work by TUDelft: MoDeST: Bridging the Gap between Federated and Decentralized Learning with Decentralized Sampling
To create a suggestion model with neural hashes using metadata as input to find songs in Creative Commons BitTorrent swarms:
Optionally, Improve the model over time, track which songs are actually downloaded or listened to by users, and use this data to train the model to improve its suggestions.
Proposal: a dedicated sprint to implementing a basic search engine.
https://colab.research.google.com/drive/1j_voFtr6j0gEStsMfcafi9FV5XJOLxjj?usp=sharing
1) Scraping metadata [ "Cullah Firebird electronic folk soul", "Serious Mastering Ego electronic", "Serious Mastering La chaleur du soleil electronic", "Oxidant Deconstruct hardcore.punk powerviolence punk", ... ] 2) Translate into embeddings
3) compare embeddings using cosimalarity query: ['Firebird'] similarity score: 0.6434079439455619 Cullah Firebird electronicfolksoul
query: ['electronic'] similarity score: 0.4482832368649311 Serious Mastering Ego electronic
similarity score: 0.3406708597897247 Serious Mastering La chaleur du soleil electronic
sample.json
APK including:
wetransfer link (118mb) https://we.tl/t-pnugzyNiRV
Question: how impressed/intimidated/confused are you about recent ML/LLM/Diffusion explosion?
@quintene answer: innovation speed is fast/sophisticed
due to everybody building on top of each other.
Johan note: What does a leaked Google memo reveal about the future of AI?
Question: how to identify and follow a long-enduring winner? 1) Alpaca on Pixel7, or 2) MLC Android or 3) https://github.com/BlinkDL/RWKV-LM
@quintene answer: Nobody has solved the magic architecture of decentralised learning! Personalised model, how to partition, can we re-use the "decentralisation layer" across the whole ML domain? Current limited approach: one dataset, one application. "dynamic distributed learning". johan note goal: non-i.i.d?
electronics
=!= electronics
, a single space leads to duplicates. "Vectorization of scraped pandaCD."Have not found resources that does not have a central server; Is it?
Related work:
Cleaning/Extention dataset: https://github.com/quintene/trustchain-superapp/blob/master/peerai/src/main/assets/scraped_data_02.json
Adding improved vectorization albums including songs, metadata, author, images
improved search
{ "artist": "Cullah", "title": "Firebird", "author_image": "https://images.pandacontent.com/artist/12/250x250/2-cullah.jpeg?ts=1675708015", "author_description": "MC Cullah is a producer/singer/songwriter/rapper from Milwaukee, Wisconsin. His music is lost somewhere in between Rock -n- Roll, Electronica and Hip Hop with a pinch of psychedelic melodies. With an arsenal of synthesizers and a library of forgotten sounds he manages to create something that sparks imagination and wonder.", "author_upcoming": [ { "context": "https://schema.org", "type": "MusicEvent", "startDate": "2023-06-15T00:00:00+00:00", "offers": "https://www.songkick.com/concerts/41175136-cullah-at-radio-milwaukee-889-fm", "name": "Radio Milwaukee 88.9 FM", "location": { "type": "PostalAddress", "addressLocality": "Milwaukee, WI, US" } ], "year": "2022", "tags": [ "electronic", "folk", "soul" ], "artwork": "https://images.pandacontent.com/release/779/250x250/1-firebird.jpeg?ts=1675708399", "magnet": "magnet:?xt=urn:btih:O2NCAP26N63U7VK6LSCXNVR3VV3ODILA&tr=udp%3A//tracker.pandacd.io%3A2710&dn=Cullah%20-%20Firebird%20%282022%29%20-%20MP3", "songs": [ "The Feather", "Firebird Credits", "The Golden Apple", "The Anima", "The King" ] },
private fun pickRandomNodeToSongEdgesToGossip(): List<NodeRecEdge> {
"Peer AI": Refactoring on earlier work "Vectorization from scratch in Kotlin" creating a Searcher model using ScaNN within tensorflow (mobile).
Research goal: Train/Share above model within P2P enviroment considering significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers.
Research into related work (fully federated learning approaches)
Seminar on distributed ML systems: Working on a project applying differential privacy within federated learning where attacks are executed. paper
Wrting on Problem Description.
BeyondFederated
Nice title!Developing a music search engine within a peer-to-peer (P2P) network presents significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers. These factors add complexity to the task of building an efficient and reliable music search engine within a decentralized environment.
. Suggestion: whole storyline on "each peer only has a partial view of the network. No central viewpoint exists with the complete overview. This severely impacts the possible solutions. None of the traditional mechanism are able to function in this leaderless environment. Traditional solution all assume a client/server or single ownership entity. We need self-organisationFMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.
Ideal sprint outcome for 15 Aug: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!
Modelling "TFLite Searcher model" using subset of dataset using only: Title, Artist, Genre/Tag, Album.
Training goal; creating vectors with similar atrributes having smaller distances.
Porting everything into Kotlin.
Pleas ensure to cite this work in your thesis, AI Benchmark: All About Deep Learning on Smartphones in 2019. website of ETH-Z en-device AI benchmarking, includes S23 results.
UPDATE: Youtube contrains more content than FMA and PandaCD. Great datasets exist. See Youtube player you could connect to your thesis focus of BeyondFederated content search with actual playable content.
Please load this [```URL_Youtube``` into Kaggle and check it out](https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube). 20230 unique music videos to recommend by 2079 artists! This would impact your work and disconnect it more of the MusicDAO code. {brainstorm input: any Youtube&magnet playback of both video or music. [116098 "music video URLs" inside this Youtube-8M dataset](https://research.google.com/youtube8m/explore.html) with annotations from a diverse vocabulary of 3,800+ visual entities for semantic search}Neural Instant Search for Music and Podcast
Finished the model design where the final .tflite model will exists from an embedder model and an ScaNN layer. Translating Dataset Key + meta data setup would transform users query input(title, genre, author) into a vector and search for closest vectors available in the network.
The model output conists of clostest neighbors including all the metadata of the dataset. Also exploring replacing scentence encoder model with (song)object embedding model. Model updates are available on users end device, next goal would be to distribute model changes in fully decentralizedFL. https://blog.tensorflow.org/2021/11/on-device-training-in-tensorflow-lite.html
However current status of being stucked implementing The TFLite Model Maker library for on-device ML applications; Creating first version of the model. This is just needed to translate the collected datasets into a .tflite model, can't get it running currently looking for other solutions... https://pypi.org/project/tflite-model-maker/
Goal: Self learned semantic network with 100k items
Todo: mention research in paper: https://arxiv.org/abs/1908.10396
TFLight Model Maker: Not being able to build tflight_model_maker since a lot of dependencies where conflicting. Resolved by custom Dockerfile with manual build steps including other libraries. (also repo update since 2 weeks)
Model Image metadata of ScaNN layer
{ "associated_files": [ { "name": "on_device_scann_index.ldb", "description": "On-device Scann Index file with LevelDB format.", "type": "SCANN_INDEX_FILE" } ] }
Key decision Learning: Determine what is exactly learned by connected clients (A rebuilded custom index vs clicklig gradients/recommendations within search higher ranked items based on Clicklog (a.k.a popular audio higher ranked))
Decentralized Learning todo;
Goal for upcoming days: Scale Scalable Nearest Neighbors. Create the first ScaNN indexer without needing to rebuild a new index based on the whole dataset and only with the ScaNN library running in some python code. The indexer will:
close
Realizing this will be a big contribution towards Tflite-support
fat Android build for multiple architectures: x86,x86_64,arm64-v8a,armeabi-v7a Succesfull builds/compiles target /java/src/java/org/tensorflow/lite/task/text/task-library-text.aar which includes custom API tasks such as extend index convert to buffer and replace in model metadata + Pack the associated index files
Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature
}Extending TFLite Support with custom API calls (On-Device Scann C++).
Currently focussing on training ScaNN, a single layer K-Means tree is used to partition the database (index) which I'am now being able to modify. The model is trained on forming partition centroids (as a way to reduce search space). In the current setup new entries are pushed in the vector space but the determination on which partition they should appear (closest to certain partition centroids) is hard.
Job to be done; Rebuilding partions.
INDEX_FILE E_X,: Which is an actual partition including compressed vectors INDEX_CONFIG: Config of embedding dimensions etc. M_Y; Metadata entry
For a dataset N X should be around SQRT(N) partitions to optimize perfromance.
No train method exposed in current model setup so another API call to expose; Either
Non perfect insert works until approximately 100k items, where new embeddings are inserted to closest partition centroids.
Older nearest neighbor paper by google
In case of On-Device limitiation of recreating the whole index including the new partitions and centroids, interesting research direction Fast Distributed k-Means with a Small Number of Rounds.
Research question shift towards: "How can the efficiency and effectiveness of SCANN be enhanced through novel strategies for dynamically adding entries, specifically focusing on the adaptive generation of K-Means tree partitions to accommodate evolving datasets while maintaining optimal search performance?"
This research question addresses the challenge of adapting SCANN, a scalable nearest neighbors search algorithm, to handle dynamic datasets. The focus is on developing innovative approaches for adding new entries in a way that optimizes the generation of K-Means tree partitions, ensuring efficient search operations as the dataset evolves.
"Evolving datasets" key in a fully decentralized (On-Device) vector space, no central entity to re-calculate all the necessary partioning/indexing.
TODO for next sprint; Focus on frozen centroids and imperfect inserts. !Keep it simple!
Also implement recommandation model; The main objective of this model is to efficiently weed out all candidates that the user is not interested in. In TensorfFlow recommender, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the titles of top entries for that user.
For Searching searching the vector space with a given query will retrieve all top-k results. Next we not only use this data for retrieving top items but to also train our User-Song recommandation model.
We then train our loss function based on: {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}
self.user_model: tf.keras.Model = user_model
self.task: tf.keras.layers.Layer = task
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
# We pick out the user features and pass them into the user model.
user_embeddings = self.user_model(features["user_id"])
# And pick out the movie features and pass them into the songmodel,
# getting embeddings back.
querylog_song_embedding = self.song_model(features["Youtube-clicked-title"])
# The task computes the loss and the metrics.
return self.task(user_embeddings, querylog_song_embedding )
{guorq, sunphil, erikml, qgeng, dsimcha, fchern, sanjivk}@google.com
). It's the core of a scientific paper to expand on-device machine learning with unbounded item insert and dynamic re-clustering :rocket: :wrench: :rocket: So not a 3-week sprint.update: idea for experimental results. Exactly show how insert/lookup starts to degrade as you insert 100k or 10 million items. Cluster become unbalanced, too big, too distorted from centroid?
Goal:
Slowly progressing due to complexity, not just append new item to partition array + C++ and though development enviroment.. For now focus on last development sprint on "ndexing new embeddings" otherwise come up with other alternatives.
Youtube Iterate trough music category: Analysis of dataset of millions of songs (150mb? -> Device ready!) https://developers.google.com/youtube/v3/docs
4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE)
real
ClickLog gossip layer within IPv8 community programming*Target of past weeks (including some time off on Holiday): Non-Perfect insert
Only 2.4 MB for 20K items Trained cluster Config!! Seeing valueable possibilities here! Such as sharing configs with peers?? Dynamic/sharable vector spaces in distributed context. Self learned or also keep sharing configs.
Bazel builds including tests do succeed!
Build delivers custom libraries currently implementing in Super App, requires custom API Calls now facing random crashes due to not supported hardware (emolator only). Current state: Debugging on older android devices.
-[x] Searching Does still work under new custom build library.
-[x] Gossip of new items/Clicklog also possible.
Different Encoder layers possible within On-Device model; Current implementation includes embeddings based on Universal scentence encoder
Meaning encodings are distributed on based semantics but not typos! Meaning
Red Red Wine, will result in UB40 - Red Red Wine
Red Red Wyne will not result in UB40 - Red Red Wine
But then Blue Wine will result in UB40 - Red Red Wine
Only 2.4 MB for 20K items Trained cluster Config!!
solid milestone to improve uponPotential extended gossip design: JSON gossip replaced by gossiped C++ vector/embedding??
{Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}
-> std::vector
Experiment on large tiktok dataset-> https://developers.tiktok.com/products/research-api/
app.close()
Date | Youtube new videos upload rate |
---|---|
January 2009 | 15 hours of video / min |
2019 | 500 hours / min |
output(j, i): -0.46781 i: 172 and w/offset 1249
output(j, i): -0.671608 i: 173 and w/offset 1250
output(j, i): -0.711928 i: 174 and w/offset 1251
output(j, i): -0.601231 i: 175 and w/offset 1252
output(j, i): -0.533054 i: 176 and w/offset 1253
output(j, i): -0.644484 i: 177 and w/offset 1254
output(j, i): -2.3841 i: 178 and w/offset 1255
Results of top 5 items:
id: 1255 with distance: -2.3841
id: 17372 with distance: -0.777172
id: 1077 with distance: -0.761045
id: 1078 with distance: -0.748582
id: 7886 with distance: -0.740518
update: fun fact, Deepmind also uses the library you use :smile: Improving language models by retrieving from trillions of tokens
3. Working custom Index with new data inserted in closest partition!
Working towards: Setup for thesis experiments. Setting up performance tests;
TODO WRITING! Much uncertainty about experiments resulted in confusing direction towards conclusion thus overall structure in paper progress.
GOAL: Aiming for deliviry in 6 weeks (26 april) including;
New Large dataset for pretrained model (8M dataset has no title/author only labels per timecode which is not required but also way too big): Therefore: YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.
The collection comprises 15,112,121 original and automatically translated transcripts from 2,063,066 videos (411,432 individual channels). 15M video transcripts could be indexed inside the model only the universal scentence encoder is more effective on english texts.
Learned a model which includes 2M unique video's but we can create a collection of 15M with same video's..
Will cost around a full day to train a model that big, running inference on universal_scentence_encoder creating 2M embeddings which is not that fast on CPU only for loop over a 500mb cs . Currently still waiting for some new pre-trained models to finish to analyze performance and wrapping up experiments section including all graphs.
Interesting result during Overflow Experiment (using same embedding/metadata to overflow specific partition) increasing from same bucket from 156 items to 300K (identical insert) only halves in cpu time. from +- 445ms to 1100ms which is still quite fast. Model size increased from 4MB to 8MB. 14MB for 700k items in a single partition but all on CPU (11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz) 2.630.718 items = 37MB Experiment: Compare 2M dataset with self "inefficient insert".
Results fluctuate much having indipendent runs.
Working on experiment of 2M pretrained model versus NPI Model, in terms of size and speed. Accuracy is a though nut to crack evaluating move into direction of ANN-benchmark metrics such as Queries amount vs recall?
Next steps also include Gossip/clicklog experiment.
Crash fix for Android API 34+ solved: APK https://we.tl/t-A4rw7naT1a
Meanwhile focus on writing but a bit confused about the results having a blown up partition still results in decent performance therefore current status is a bit chaotic and raw.
445ms to 1100ms which is still quite fast.
nothing more is needed.2,063,0 66 Dataset
Search term -> Query:
This band also has 10 YouTube videos learned. The queries include
please turn this bullet list into a Table.TODO: [ ] Figues still need to fixed position Floating HELL [ ] streamline terms used: Entries = New Songs, test dataset for insert remove dont call "fantasy" songs.. [ ] Replace default scann picture by how song -> Embedding ->Quantinaztation -> Hashing -> CLostest Neighbor. [ ] Conclusion -> Merge all results together. Potentially SCANN but not with non perfect insert. [ ] Abstract [ ] Small section about future work not too heavy [ ] Network experiment needs a bit more context what the results show for Beyond Federated.
update:
comments:
Started full-time thesis around april/may 2023.
Track DST, Q3/4 start. Still "seminar course" ToDo. Has superapp/MusicDAO experience. Discussed as diverse as digital Euro and Web3 search engine (unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance). Done
machine Learning I
class. (background: Samsung solution, ONE (On-device Neural Engine): A high-performance, on-device neural network inference framework.Recommendation or semantic search? Alternative direction. Some overlap with the G-Rank follow-up project. Essential problem to solve: learning valid Creative Commons Bittorrent swarms.
Second sprint (strictly exploratory):
Doing information retrieval msc course to prepare for this thesis
Literature survey initial idea: "nobody is doing autonomous AI" {unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance}.