BeyondFederated - truly decentralised learning at the edge

Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.86k stars 450 forks source link

BeyondFederated - truly decentralised learning at the edge #7254

Closed synctext closed 3 months ago

synctext commented 1 year ago

Started full-time thesis around april/may 2023.

Track DST, Q3/4 start. Still "seminar course" ToDo. Has superapp/MusicDAO experience. Discussed as diverse as digital Euro and Web3 search engine (unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance). Done machine Learning I class. (background: Samsung solution, ONE (On-device Neural Engine): A high-performance, on-device neural network inference framework.

Trust and attacks out of scope?, see our “Universal Trust Machine”, https://arxiv.org/abs/2301.06938
Jan 2005 decentral recommender "P2P-based PVR Recommendation using Friends, Taste Buddies and Superpeers"
Sep 2011 gossip learning pioneering work
TikTok recommendation blog intro
Sep 2022: TikTok recommender - "Monolith: Real Time Recommendation System With Collisionless Embedding Table"
Jan 2023: "G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network"
ToDo: Is my allergy for bloatware libraries for Ux and ML [tensorflow light] still justified.
Overlap with master thesis on balancing relevance + trust
Decide: learn-to-rank search engine or recommender?
Possible master thesis novelty: "Web3 recommender with concept drift" (no tagging, no trust, no feature engineering)

Recommendation or semantic search? Alternative direction. Some overlap with the G-Rank follow-up project. Essential problem to solve: learning valid Creative Commons Bittorrent swarms.

"Learning Hash Codes via Hamming Distance Targets", https://arxiv.org/pdf/1810.01008.pdf
nearest-neighbor search: https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6
Hash output: ??? stemming, embedding, tagging, one hot encoding ???
[UPDATE]
- https://www.algolia.com/blog/ai/vectors-vs-hashes/
- locality hashing feeding into hash2hash matrix recommender. hash learning
- another Hash Embeddings for Efficient Word Representations
[UPDATE2]
- Sequence-to-sequence learning (Seq2Seq) A ten-minute introduction to sequence-to-sequence learning in Keras
- Audio Fingerprinting: A Quick Startup Guide
- https://classic.d2l.ai/chapter_recurrent-modern/seq2seq.html Pytorch example from this URL:

class Seq2SeqEncoder(d2l.Encoder):
    """The RNN encoder for sequence to sequence learning."""
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers,
                 dropout=0, **kwargs):
        super(Seq2SeqEncoder, self).__init__(**kwargs)
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.GRU(embed_size, num_hiddens, num_layers,
                          dropout=dropout)

    def forward(self, X, *args):
        # The output `X` shape: (`batch_size`, `num_steps`, `embed_size`)
        X = self.embedding(X)
        # In RNN models, the first axis corresponds to time steps
        X = X.permute(1, 0, 2)
        # When state is not mentioned, it defaults to zeros
        output, state = self.rnn(X)
        # `output` shape: (`num_steps`, `batch_size`, `num_hiddens`)
        # `state` shape: (`num_layers`, `batch_size`, `num_hiddens`)
        return output, state

Second sprint (strictly exploratory):

get musicDAO running from source.
read at least 4 master thesis works from the lab

Doing information retrieval msc course to prepare for this thesis

Literature survey initial idea: "nobody is doing autonomous AI" {unsupervised learning, online learning, adversarial, byzantine, decentralised, personalised, local-first AI, edge-devices only, low-power hardware accelerated, and self-governance}.

synctext commented 1 year ago

ToDo: register https://mare.ewi.tudelft.nl/project

Latest work by TUDelft: MoDeST: Bridging the Gap between Federated and Decentralized Learning with Decentralized Sampling

quintene commented 1 year ago

To create a suggestion model with neural hashes using metadata as input to find songs in Creative Commons BitTorrent swarms:

Collect (Scrape) a dataset of songs and their corresponding metadata from Creative Commons BitTorrent swarms.
Use the neural network to generate neural hashes for each song in the dataset. These neural hashes would represent each song as a high-dimensional vector that captures its features and characteristics.
DOING: Research in the distribution of hashes; possible directions: 3.a Each node would have a copy of the neural network and the neural hashes for some subset of the songs. The distribution of songs could be done based on some criteria like proximity or similarity of neural hashes, for example. 3.b Use of a multi-index hashing scheme.
When a user wants to search for songs based on a given input genre, the query would be broadcasted to all nodes. Each node would then perform a nearest neighbor search on its own subset of the neural hashes to find the songs that are most similar to the input metadata.
The results from each node would be collected and combined to generate a list of suggested songs. This could be done by taking the top N results from each node, and then combining them based on their relevance or popularity.

Optionally, Improve the model over time, track which songs are actually downloaded or listened to by users, and use this data to train the model to improve its suggestions.

synctext commented 1 year ago

Proposal: a dedicated sprint to implementing a basic search engine.

in a 3 week sprint: create a small Create Commons dataset, extract keywords, learn keyword associated with a swarm. Simple cmdline test script in Kotlin. create demo.APK on this issue. No machine learning yet: keywords+swarms represented by a vector of TF-IDF weights.
:heavy_check_mark: Collect (Scrape) a dataset of songs and their corresponding metadata from Creative Commons BitTorrent swarms.
Create a small dataset to use for your thesis
- https://pandacd.io/index.php?tagged=classical+punk+jazz
- https://github.com/brian2509/pandacd-scrape
https://towardsdatascience.com/create-a-simple-search-engine-using-python-412587619ff5
https://halvorboe.medium.com/building-a-search-engine-from-scratch-e102ac39855a

quintene commented 1 year ago

https://colab.research.google.com/drive/1j_voFtr6j0gEStsMfcafi9FV5XJOLxjj?usp=sharing

1) Scraping metadata [ "Cullah Firebird electronic folk soul", "Serious Mastering Ego electronic", "Serious Mastering La chaleur du soleil electronic", "Oxidant Deconstruct hardcore.punk powerviolence punk", ... ] 2) Translate into embeddings

3) compare embeddings using cosimalarity query: ['Firebird'] similarity score: 0.6434079439455619 Cullah Firebird electronicfolksoul

query: ['electronic'] similarity score: 0.4482832368649311 Serious Mastering Ego electronic

similarity score: 0.3406708597897247 Serious Mastering La chaleur du soleil electronic

synctext commented 1 year ago

solid progress in your part-time thesis startup.
clean scrape now done sample.json
make a simple android .APK that installs on my phone, new superapp icon
it can search in strings, show matches, and {random} rank
focus on functional, not yet on efficiency or ease-of-use.

quintene commented 1 year ago

APK including:

Peer AI Kotlin Module
Vectorization of scraped pandaCD.
Cosimalrity between vectors and search query
code forked of superapp: https://github.com/quintene/trustchain-superapp/tree/master/peerai

wetransfer link (118mb) https://we.tl/t-pnugzyNiRV

synctext commented 1 year ago

Question: how impressed/intimidated/confused are you about recent ML/LLM/Diffusion explosion? @quintene answer: innovation speed is fast/sophisticed due to everybody building on top of each other. Johan note: What does a leaked Google memo reveal about the future of AI? Question: how to identify and follow a long-enduring winner? 1) Alpaca on Pixel7, or 2) MLC Android or 3) https://github.com/BlinkDL/RWKV-LM @quintene answer: Nobody has solved the magic architecture of decentralised learning! Personalised model, how to partition, can we re-use the "decentralisation layer" across the whole ML domain? Current limited approach: one dataset, one application. "dynamic distributed learning". johan note goal: non-i.i.d?

Related work: Varuna: Scalable, Low-cost Training of Massive Deep Learning Models. This is all still simple federated learning by Microsoft, with a single central coordinating server. "Varuna slices a DNN model into sequential pipeline stages. For this, the model should be annotated with varuna CutPoint instances between different operations/ parts of model computation.". Architectures:
- Data-parallel Training (spread data)
- Model-parallel (spread parameters)
- Inter-layer partitioning (spread layers of model)
- Intra-layer parallelism (spread of each layer)
Vectorization from scratch in Kotlin, so electronics =!= electronics, a single space leads to duplicates. "Vectorization of scraped pandaCD."
504 line of raw data 'Partition 36 The Optic Nerve electronic trance'
see comment here, please clean you dataset with 1) Bitcoin donation to artist 2) profile of artist 3) ISNI code of artist 4) songnames 5) Magnet link to Creative Commons songs
btw useful overfitting of 4 images using autoencoder that Marcel got working
- https://github.com/mohit1997/DeepZip "DeepZip: Lossless Data Compression using Recurrent Neural Networks"
- https://github.com/Model-Compression/Lossless_Compression Lossless' Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach" (NTK-LC)

quintene commented 1 year ago

Have not found resources that does not have a central server; Is it?

Related work:
Cleaning/Extention dataset: https://github.com/quintene/trustchain-superapp/blob/master/peerai/src/main/assets/scraped_data_02.json
Adding improved vectorization albums including songs, metadata, author, images
improved search

https://we.tl/t-0ffNeOjjJO

{ "artist": "Cullah", "title": "Firebird", "author_image": "https://images.pandacontent.com/artist/12/250x250/2-cullah.jpeg?ts=1675708015", "author_description": "MC Cullah is a producer/singer/songwriter/rapper from Milwaukee, Wisconsin. His music is lost somewhere in between Rock -n- Roll, Electronica and Hip Hop with a pinch of psychedelic melodies. With an arsenal of synthesizers and a library of forgotten sounds he manages to create something that sparks imagination and wonder.", "author_upcoming": [ { "context": "https://schema.org", "type": "MusicEvent", "startDate": "2023-06-15T00:00:00+00:00", "offers": "https://www.songkick.com/concerts/41175136-cullah-at-radio-milwaukee-889-fm", "name": "Radio Milwaukee 88.9 FM", "location": { "type": "PostalAddress", "addressLocality": "Milwaukee, WI, US" } ], "year": "2022", "tags": [ "electronic", "folk", "soul" ], "artwork": "https://images.pandacontent.com/release/779/250x250/1-firebird.jpeg?ts=1675708399", "magnet": "magnet:?xt=urn:btih:O2NCAP26N63U7VK6LSCXNVR3VV3ODILA&tr=udp%3A//tracker.pandacd.io%3A2710&dn=Cullah%20-%20Firebird%20%282022%29%20-%20MP3", "songs": [ "The Feather", "Firebird Credits", "The Golden Apple", "The Anima", "The King" ] },

synctext commented 1 year ago

Working "Peer AI" superapp !
- matching search query with {local-only} music dataset
- please look at this release of Superapp
- build upon an operational machine learning prototype?
- private fun pickRandomNodeToSongEdgesToGossip(): List<NodeRecEdge> {
Following the master course: Seminar on distributed machine learning systems (by Lydia)
The following Creative Commons music tags are available for bootstrap purposes (see other Tribler music issues):
- "55,525 tracks annotated by 87 genre tags, 40 instrument tags, and 56 mood/theme tags", https://github.com/MTG/mtg-jamendo-dataset
- tags + "ID, title, artist, genres" for 100.000 tracks https://github.com/mdeff/fma
Draft thesis title: "BeyondFederated: truly decentralised learning at the edge"
- search without servers is unsolved. Only theory for 25 years. No proven solutions or actual usage.
- academically pure: decentralised, self-organised systems
- only use the User-Tag-Item matrix for recommendation and content discovery please.
- no beats per minute, no vibe of songs, just pure tags and playlist info
- go back to our 2007 roots by Maarten his work at Delft
ToDo: next sprint also make a 1 page Problem Description and read prior master thesis articles

quintene commented 1 year ago

"Peer AI": Refactoring on earlier work "Vectorization from scratch in Kotlin" creating a Searcher model using ScaNN within tensorflow (mobile).
Research goal: Train/Share above model within P2P enviroment considering significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers.
Research into related work (fully federated learning approaches)
Seminar on distributed ML systems: Working on a project applying differential privacy within federated learning where attacks are executed. paper
Wrting on Problem Description.

thesisproblem_iteration_qvaneijs.pdf

synctext commented 1 year ago

BeyondFederated Nice title!
Developing a music search engine within a peer-to-peer (P2P) network presents significant challenges due to the limited availability of peers, lack of trust, and dynamic identities of peers. These factors add complexity to the task of building an efficient and reliable music search engine within a decentralized environment.. Suggestion: whole storyline on "each peer only has a partial view of the network. No central viewpoint exists with the complete overview. This severely impacts the possible solutions. None of the traditional mechanism are able to function in this leaderless environment. Traditional solution all assume a client/server or single ownership entity. We need self-organisation
Good find on Papaya. More related work exists than I was aware of. Peer-to-peer Federated Learning on Graphs
"Swarm Learning", very catchy term. SwarmLearning2 ?
Please avoid content analysis and homomorphic crypto
Very good dataset choice. Just do your thesis on Pandacd (with Bitcoin wallet for artists!!!) and FMA. FMA aims to overcome this hurdle by providing 917 GiB and 343 days of Creative Commons-licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres.
Semantic search as cardinal focus of your thesis
- collaborative filtering uses a traditional 1-dimensional similarity: Pearson similarity, Cosine similarity, Euclidean distance, Manhattan distance, etc.
- LLM-like hundreds of dimensions
- Beyond simple: user-item matrix
- Semantic clustering: Each user has preference for certain items, only knows about a limited set of items, and has a model that only knows a limited set of items with a bias for semantic similarity. All users together know every item.

Ideal sprint outcome for 15 Aug: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!

quintene commented 1 year ago

Modelling "TFLite Searcher model" using subset of dataset using only: Title, Artist, Genre/Tag, Album.

Generate embeddings for the data using with custom model generating multi-dimensional vectors space corresponding the previous mentioned dataset. Current embedding models mainly focusses on larger texts such as small paragraphs to generate vectors used in semantic search. Goal here is to generate multidimensional vectors to respectively keep multiple meta-data attributes in the same dimensions.

Training goal; creating vectors with similar atrributes having smaller distances.

Build an ANN index for the embeddings (https://projector.tensorflow.org/)
Use the index for similarity matching using spotify Approximate Nearest Neighbors (https://github.com/spotify/annoy) Approximate Nearest Neighbors is a C++ library to search for points in space that are close to a given query point. Which will be derrived from the kotlin app.

Porting everything into Kotlin.

Trained model used in kotlin using Tensorflow Lite
Implementing ANNOY (rust) bindings into Kotlin app. MVP: Input Query -> Embedding using custom model -> finding semntical similarities from Approximate Nearest Neighbors.

synctext commented 1 year ago

Still very much in exploratory phase!?
Big picture is still unclear, unsupervised approach? What are you optimising? learning phase? Cost function?
Approximate search, cluster search, vector cloud.. What master thesis figure will convey to the reader that everything works and is brilliant?
Keep the semantic search simple. Maximum 2 weeks to wrap up that part. Then focus on BeyondFederated full self-organisation with IPv8 gossip community. Spread new content items and news genre-overlap info.
{repeating} Ideal sprint outcome for ~~15 Aug~~: a operational PeerAI .APK with minimal viable "TFLite Searcher model" with PandaCD and FMA. Focus: genre similarity. Adding new item, fine-tuning, exchanging models is still out of scope. Lets get this operational first!
Write 1 page Problem Description {update to more into-level existing draft}

synctext commented 1 year ago

Pleas ensure to cite this work in your thesis, AI Benchmark: All About Deep Learning on Smartphones in 2019. website of ETH-Z en-device AI benchmarking, includes S23 results.

UPDATE: Youtube contrains more content than FMA and PandaCD. Great datasets exist. See Youtube player you could connect to your thesis focus of BeyondFederated content search with actual playable content.

Please load this [```URL_Youtube``` into Kaggle and check it out](https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube). 20230 unique music videos to recommend by 2079 artists! This would impact your work and disconnect it more of the MusicDAO code. {brainstorm input: any Youtube&magnet playback of both video or music. [116098 "music video URLs" inside this Youtube-8M dataset](https://research.google.com/youtube8m/explore.html) with annotations from a diverse vocabulary of 3,800+ visual entities for semantic search}

quintene commented 1 year ago

Neural Instant Search for Music and Podcast

Finished the model design where the final .tflite model will exists from an embedder model and an ScaNN layer. Translating Dataset Key + meta data setup would transform users query input(title, genre, author) into a vector and search for closest vectors available in the network.

The model output conists of clostest neighbors including all the metadata of the dataset. Also exploring replacing scentence encoder model with (song)object embedding model. Model updates are available on users end device, next goal would be to distribute model changes in fully decentralizedFL. https://blog.tensorflow.org/2021/11/on-device-training-in-tensorflow-lite.html

However current status of being stucked implementing The TFLite Model Maker library for on-device ML applications; Creating first version of the model. This is just needed to translate the collected datasets into a .tflite model, can't get it running currently looking for other solutions... https://pypi.org/project/tflite-model-maker/

Goal: Self learned semantic network with 100k items

Todo: mention research in paper: https://arxiv.org/abs/1908.10396

synctext commented 1 year ago

ToDo: meeting with @qstokkink and the new phd on learn-to-rank.
- GOAL: BeyondFederated - a fully decentralised semantic search machine learning
- Thesis pictures: Cluster vector space and zoom into "rock", "hard rock", "soft rock", and "gothic"
- the absolute performance is totally not interesting! It's about the unique architecture and running code
solid architectural and conceptual progress on your thesis!
- as simple as possible. Also implement cosine-similarity, euclidean distance, Jaccard coefficient
- Compare to 5, 10, 20, 40, and 100 dimension results
- What is your measure of success?
- no ground truth
TFLite Model Maker. Big dragon to slay. Never give up? :dragon_face: :crossed_swords: :dragon_face:
Player solution with Native Android WebView usage (no full JavaScript I guess).
Conceptual dream goal:
- Fully decentralised, CHOOSE:
- gossip data - decentralised ClickLog, like its 2005 (see our 2005 prior ClickLog work for that year)
- gossip model weights - decentral machine learning
- Youtube and also Tiktok are central Big Tech services, but today they still are the monopoly sources of content.
- Universal player idea: use the platform aggregation trick against Big Tech. Play any content from the network on a mobiel phone. Support Youtube, Tiktok, Netflix, Spotify, IPFS, and torrents.
- Contribution is as easy as creating a Tiktok video. Authentic amateur content as a first-class citizen. Against mega studios and hyper commercial influencers.
- On-device AI. AI-centric universal player. Decentral AI is based on your incremental model update in TFLite.
- Next Master thesis {future work}: beyond semantic search, semantic clustering. 100k items on each on-device AI are personalised, depending on the {evolving} taste of the user. TikTok inspired, superior content discovery and enabler of deep long-tail community content.

quintene commented 1 year ago

Implemented search model into kotlin [Embedder + ScaNN].tflite on youtube dataset. (apk)
Testing custom Embedding vs Scentence Encoding model layer.
Adding new items and gossiping adds new challenge since ScaNN layer in model keeps track of index in LevelDB format; Meaning a rebuild of the index is required to distribute new items/audio. Also this layer is more of a traditional "Database" layer gossiping of gradients in Federated Learning will require more of clicklog based recommendations to share gradients and average this data in a fully decentralized enviroment.
Adding youtube player to display results.

TFLight Model Maker: Not being able to build tflight_model_maker since a lot of dependencies where conflicting. Resolved by custom Dockerfile with manual build steps including other libraries. (also repo update since 2 weeks)

Model Image metadata of ScaNN layer

{ "associated_files": [ { "name": "on_device_scann_index.ldb", "description": "On-device Scann Index file with LevelDB format.", "type": "SCANN_INDEX_FILE" } ] }

Key decision Learning: Determine what is exactly learned by connected clients (A rebuilded custom index vs clicklig gradients/recommendations within search higher ranked items based on Clicklog (a.k.a popular audio higher ranked))

Decentralized Learning todo;

Rebuild Index / Extend on device training function with 'gossip gradients'.
Define initial state when a user does not have a (tflite)model available or an older version. (initial thoughts: share latest version by connected peers)
App engineering visual updates

synctext commented 1 year ago

Scientific architecture: loss function minimisation of (mis-)clicks on recommended items.
- self-supervised learning. ToDo: model, feedback, loss and architecture.
- Personalised learning and semantic search.
- cold start problem: first usage, no profile yet. Recommend most popular discovered items.
- Frontpage is simply most popular discovered items, based on ClickLog gossip. After first seconds we will only obtain a few clicklogs, so not much global popularity discovery yet. Only if we decide later to bias the gossip with 1 item in top-50 with each outgoing message.
- New content item discovery is most simple possible with the gossip layer. Each incoming ClickLog message may contain numerous new discovered items. Simply use this as primary discovery method.
- No security yet. Later master thesis projects can augment your with with MeritRank and web-of-trust against spam, fraud, and Sybil attacks.
- {again} keep it as simple as possible for future enhancement and re-write.
- Bonus Semantic clustering. Connect to several peers in IPv8 with most similarity to our personal ClickLog.
- Warning: sending ClickLog beyond the 1492 Bytes UDP limit. You get into EVA binary transfer hell, avoid at all cost!
Cardinal question: can this scale to TikTok size or will it implode. See IPFS performance
hot topic: rebuilding index on Android, post by Quinten :-)
[X] 281 MByte app works!
- very fast and responsive
- Great start
- Plays Youtube content, very disruptive :exclamation:
- Especially content discovery of unlisted Youtube URLs without advertisements :exclamation: :exclamation:
Big Tech is increasingly under attack. See new book from left-wing viewpoint end-of-capitalism called Technofeudalism. This is bit political, feel free to leave activism out of your thesis. Your show :dancing_men:

quintene commented 1 year ago

Hard time transforming leveldb index within metadata of the .tflite model, transforming required to work with new items. Index is build based on whole dataset.
TODO: Rebuild ScaNN index. Update 100dimesion vector is hard. Requires of whole dataset.
The only supports classic reccomendation(inference) without extending the learning space. Learning is done through an existing set of items. When new items emerge the current implementation lacks "learning" (adding new vectors to the learning space) to execute inference. (Why is it called Scalable Nearest Neighbors) Scalability is not defined by extending new items..

Goal for upcoming days: Scale Scalable Nearest Neighbors. Create the first ScaNN indexer without needing to rebuild a new index based on the whole dataset and only with the ScaNN library running in some python code. The indexer will:

Create a new index based on new vectors(either user vector or audio vector).
Update the model's metadata such that it could be applied for local learning.

synctext commented 1 year ago

focus on single .APK, first get running code, push code!! and only then focus again on decentral AI
Failure to scale of "Scalable Nearest Neighbors" by Google Research
- Static Nearest Neighbors with scalability :thinking: (better fitting name)
- no update of vector space
- TFLite does not support this!?!
- No on-device AI (only exists in simple recommenders and PR blog posts :smiley: )
- Awesome challenge for a young master student to fix :boom: :tada: You found a solid thesis project
- BeyondFederated: first deployed true Decentralised Artificial Intelligence
Master thesis level picture!! Solid progress as usual.
- new item discovery is hard
- Github bug report: scaNN algorithm: how to build index incrementally?
- code is crashing on update
Solidly stuck now 2 times, no problem. The core of decentral AI was never going to be a 2 sprint thingie.
Keep it simple! Keep it as simple as possible. No sync operation, no dedicated bootstrap, or block swap sync. Just a single ClickLog message to parse. Only a single method for both empty new peers and mere updates. You can simply ask peers for more ClickLog messages, insert them in vector space, update structs, and repeat. This gives natural rate control and avoid congestion. Emergent effect is quick, simple, and full-speed (cpu,IO, networking) bootstrapping.
~~accept gossip for users who are relatively close~~. Please avoid any bias at this moment, keep it simple, measure, performance analysis, then tune! Keep decentral global search alive, not current narrow taste bias. Close to 2006 Buddycast: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=423253720d670bba267639473af8697e44f19879

quintene commented 1 year ago

Hot reloading vector spaces works . Model architecture remains intact while it is possible to swap between different vectorspaces, Either from a whole new dataset or existing adjusted vectorspace. Next Step: Single item add and remove (Incremental) absl::btree_map<std::string, size_t> ordered_partition_key_to_index;
If I have this working, the ondevice learning then I will be able to make this a web3 app. Bootsrapping sparse data on new items for on-device learning.
Finally being able to overwrite vectorspace on device models which allows new gossiped items to be extended. However to make this happen I had to extend the tensorflow-lite API. Which is requires java bindings to native c++ code.
Extending with the appendToVectorSpace(String key, Value) which 1.encodes the new item, 2. builds the index from the initialized underlying leveldb instance, 3. Coverts the index to a Flatbuffer which then is zipped and instead of (append-only support) now also allows overwrites.
Allow java code with native JNI binding to compiled c++.
When succesfull we have ON_DEVICE expension of the vector space such that we allow learning in dynamic enviroment where new items are gossiped by neighbors.
Future Work: Performance analyse, native on-device learning vs vector database.

Realizing this will be a big contribution towards Tflite-support

(b/180502532): add pointer to example model.
(b/222671076): add factory create methods without options, such as `createFromFile`, once the single file format (index file packed in the model) is supported.

fat Android build for multiple architectures: x86,x86_64,arm64-v8a,armeabi-v7a Succesfull builds/compiles target /java/src/java/org/tensorflow/lite/task/text/task-library-text.aar which includes custom API tasks such as extend index convert to buffer and replace in model metadata + Pack the associated index files

synctext commented 1 year ago

{repeating} focus on single .APK, first get running code, push code!! and only then focus again on decentral AI.
Stable (281MByte) .APK??
JNI hell
- Yet again fighting a known difficult issue
- Before it was: 1) TFLight Model Maker build conflicts 2) rebuilding leveldb index
full stack engineering. Lot's to keep in your head: native code Kotlin versus cpp. Tensorflow light bindings. Android NDK models. 100-dimensions world vectors. LevelDB storage model. Scalable Nearest Neighbors efficient search. content-search user model. Web3 GUI of Youtube
Understanding this full stack + zero-comments code by Google Research Lab might be too much for average students. This is a severely challenging thesis project, requires significant engineering talent. :rocket: :medal_sports:
New item discovery, fresh item problem, cold start, Bootstrapping of recommenders are all known problems. Fresh Content Needs More Attention: Multi-funnel Fresh Content Recommendation
ClickLog design idea {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}
Final toolchain will be very fragile. Breaks if you version upgrade a single ARMv7 dep, NDK version, SDK of Superapp. Docker distribute @ Linux without Windows10?
December wrap-up, bundle results and graphs
Final sprint: PR on official TF-lite Github repo??? Small blog/Substack post to explain? Obviously an arXiv of thesis! Deployment of Superapp with Web3 Youtube and distributed AI?

quintene commented 1 year ago

Extending TFLite Support with custom API calls (On-Device Scann C++).

Being able to modify LevelDB entries
Finally working towards training the model with the expanded vector space.

Currently focussing on training ScaNN, a single layer K-Means tree is used to partition the database (index) which I'am now being able to modify. The model is trained on forming partition centroids (as a way to reduce search space). In the current setup new entries are pushed in the vector space but the determination on which partition they should appear (closest to certain partition centroids) is hard.

Job to be done; Rebuilding partions.

INDEX_FILE E_X,: Which is an actual partition including compressed vectors INDEX_CONFIG: Config of embedding dimensions etc. M_Y; Metadata entry

For a dataset N X should be around SQRT(N) partitions to optimize perfromance.

No train method exposed in current model setup so another API call to expose; Either

Build k-mean tree with new item-> Decentralized K-Mean clustering? replace partions with new clustered partitions.
Expose train function and export clusters save them to partion file in index.

Non perfect insert works until approximately 100k items, where new embeddings are inserted to closest partition centroids.

Older nearest neighbor paper by google

In case of On-Device limitiation of recreating the whole index including the new partitions and centroids, interesting research direction Fast Distributed k-Means with a Small Number of Rounds.

Research question shift towards: "How can the efficiency and effectiveness of SCANN be enhanced through novel strategies for dynamically adding entries, specifically focusing on the adaptive generation of K-Means tree partitions to accommodate evolving datasets while maintaining optimal search performance?"

This research question addresses the challenge of adapting SCANN, a scalable nearest neighbors search algorithm, to handle dynamic datasets. The focus is on developing innovative approaches for adding new entries in a way that optimizes the generation of K-Means tree partitions, ensuring efficient search operations as the dataset evolves.

"Evolving datasets" key in a fully decentralized (On-Device) vector space, no central entity to re-calculate all the necessary partioning/indexing.

TODO for next sprint; Focus on frozen centroids and imperfect inserts. !Keep it simple!

Also implement recommandation model; The main objective of this model is to efficiently weed out all candidates that the user is not interested in. In TensorfFlow recommender, both components can be packaged into a single exportable model, giving us a model that takes the raw user id and returns the titles of top entries for that user.

For Searching searching the vector space with a given query will retrieve all top-k results. Next we not only use this data for retrieving top items but to also train our User-Song recommandation model.

We then train our loss function based on: {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature}

    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the user features and pass them into the user model.
    user_embeddings = self.user_model(features["user_id"])
    # And pick out the movie features and pass them into the songmodel, 
    # getting embeddings back.
    querylog_song_embedding = self.song_model(features["Youtube-clicked-title"])

    # The task computes the loss and the metrics.
    return self.task(user_embeddings, querylog_song_embedding )

synctext commented 1 year ago

Thesis: perfect semantic search using on-device learning. Decentral alternative for keyword search versus recommendation? Primary dataset: https://www.kaggle.com/datasets/salvatorerastelli/spotify-and-youtube/data
- {repeat the repeating} focus on single .APK, first get running code, push code!! and only then focus again on decentral AI.
- have stable running code
- first non-perfect insert operational. First performance graphs
- Only then you are ready for the most difficult quest of your thesis!
Milestones:
- [X] TFLight Model Maker build conflicts
- [X] rebuilding leveldb index
- [X] JNI hell
- [X] Non perfect insert
- [ ] working app
- [ ] first "performance analysis" section graphs of thesis (item insert, insert time; lookup times)
- [ ] Perfect insert by vastly expanding ScaNN
- [ ] Cluster splitting for unbounded dynamic inserts in ScaNN (outside thesis scope?????) Balanced cluster sizes, bookkeeping, etc.
Bit harsh on yourself. It took 7 Google Research people to craft this algorithm + Lib ({guorq, sunphil, erikml, qgeng, dsimcha, fchern, sanjivk}@google.com). It's the core of a scientific paper to expand on-device machine learning with unbounded item insert and dynamic re-clustering :rocket: :wrench: :rocket: So not a 3-week sprint.
BeyondFederated - Truly decentralised machine learning using vector space

update: idea for experimental results. Exactly show how insert/lookup starts to degrade as you insert 100k or 10 million items. Cluster become unbalanced, too big, too distorted from centroid?

quintene commented 11 months ago

Full focus finding out where to find/how to implement: "Indexing new embeddings, including assigning them to closest partitions and AH quantize them" according to some github comment it should be there somewhere but it isn't there. https://github.com/tensorflow/tflite-support/blob/dd5f8a96054b3b602b9eea19b072fd41b6432a5d/tensorflow_lite_support/scann_ondevice/README.md?plain=1#L17
Planning to setup meeting with contributer of tensorflow lite support.
Finding my way with Co-Pilot helping in C++ bindings in bazel build for kotlin aar.

Goal:

.TFLite Model is initialized into Kotlin
Model metadata (flatbuffers) parsed exposed underlying C++ bindings.
Partions + index_config.txt + metadata is parsed into LevelDB c++.
New item is shared in the decentral network, item should be: 4.a Embed/Vectorize item (CHECK) 4.b AH quantize vector (CHECK) 4.c find clostest centroid (CHECK) 4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE) 4.e add metatdata into key value entry
Overwrite(or append) metadata model ->flatbuffers.

Slowly progressing due to complexity, not just append new item to partition array + C++ and though development enviroment.. For now focus on last development sprint on "ndexing new embeddings" otherwise come up with other alternatives.

Youtube Iterate trough music category: Analysis of dataset of millions of songs (150mb? -> Device ready!) https://developers.google.com/youtube/v3/docs

synctext commented 11 months ago

4.d Add embedding into clostest partition LevelDB key value entry (STUCK HERE)
Frustrating slow progress, understandably.
Full focus on "non-perfect insert".
Core of thesis is performance analysis of on-device k-means lib (insert time, etc.)
- instead of synthetic data we also present preliminary work with real music
- Performance analysis is done with most simple possible dataset.
- synthetic data with 10 clusters, 10 dimensions. Keep it simple!
- 3 Figures: insert, non-perfect insert, total cpu, ????
- Avoid the complexity of real datasets, like Facebook live sellers, like within k-mean tutorial example
- Then: for our second and final experiment we show the viability using using our work for music. Our work is a proof-of-principle and NOT a full featured alternative to Spotify or Youtube as a single master thesis project.
- We have a real ClickLog gossip layer within IPv8 community programming
- actual gossip learning
- 3 master thesis figures (not more figure needed in entire master thesis article :open_mouth: )
  - show performance and convergence between 2 devices (or 1 device and emulator)
  - bandwidth usage; performance in cpu,mem, storage?
- Experiment: start with 10k known items, exchange 1 clicklog message per 1 second.
- use public dataset of 20.230 unique Youtube URLs into a feature vectors, perfect on-device semantic search.
- ~~What is your source for tags? Possible to do Musicbrainz tag lookup~~ do not claim usability of thesis work for actual music semantics! Just a proof-of-principle
Upcoming sprint: non-perfect insert :exploding_head:

quintene commented 10 months ago

*Target of past weeks (including some time off on Holiday): Non-Perfect insert

Exploiting SCANN Config from devices .TFlite model. Deep Dive analysis..
Adjust 20K SCANN config which includes pre-trained model. index_config copy.txt Includes the required modification to
1. Calculate asymmetric hash of embedding
2. Closest point adjust corresponding leaf
3. Update/Shift global_partition_offsset

Only 2.4 MB for 20K items Trained cluster Config!! Seeing valueable possibilities here! Such as sharing configs with peers?? Dynamic/sharable vector spaces in distributed context. Self learned or also keep sharing configs.

Bazel builds including tests do succeed!
Build delivers custom libraries currently implementing in Super App, requires custom API Calls now facing random crashes due to not supported hardware (emolator only). Current state: Debugging on older android devices.

-[x] Searching Does still work under new custom build library.

ezgif-5-3136367c23

-[x] Gossip of new items/Clicklog also possible.

https://we.tl/t-hIE6pLWXDU

Different Encoder layers possible within On-Device model; Current implementation includes embeddings based on Universal scentence encoder

Meaning encodings are distributed on based semantics but not typos! Meaning Red Red Wine, will result in UB40 - Red Red Wine Red Red Wyne will not result in UB40 - Red Red Wine
But then Blue Wine will result in UB40 - Red Red Wine

synctext commented 10 months ago

"Frustrating slow progress", that was the theme for past half year :crying_cat_face:
Now you have the first running code of ~~decentralised AI~~, full decentralisation with on-device machine learning
Beyond Federated Learning: 320 MByte installer :rofl: :racehorse: :confetti_ball:
Only 2.4 MB for 20K items Trained cluster Config!! solid milestone to improve upon
Critical for master thesis: scientific motivation of loss function choice
"Blue wine search experiment"
- Blue == Counting Crows - Colorblind
- Blue wine == red red wine
Upcoming sprint: Smartphone-based gossip learning
Next sprint: experiments and target figures for master thesis (what, why, how). Insert time, dataset, goal of each experiment.
write thesis, Diploma :clap:

quintene commented 10 months ago

Progess towards final app version:
- New entry is entered in the Super App (Kotlin)
- Entry then is embedded through native C++ (custom) ScaNN library.
- Clostest centroid is found given emedded entry and its quantization vector.
- The closest centroid determines the partition this vector should be placed in.
- Addjust LevelDB entry of partition: append to 'E{partition}' and Metadata 'M{iteminpartition + offset}'
- TODO: When kotlin is closed the underlying levelDB instance is destroyed we need to make sure to overwrite the index_file as while which is shipped into to our .tflitemodel.
- Need one last sprint to get everything shipped into super app. Compiling customized ScaNN TFLite .aar files takes so much time to compile...
Visualizations consideering the possible experiments;
- Score-aware quantization loss functions (Measuring Nearest neighbor search)
- Recall (the fraction of true nearest neighbors found, on average over all queries)
- Metric for different NN algorithms: https://ann-benchmarks.com/
- Non perfect inserts perfomance and their bottleneck when partitions get too big.
- Perfect after insert; recalculate centroids in k-mean?

Potential extended gossip design: JSON gossip replaced by gossiped C++ vector/embedding?? {Query, Youtube-clicked-URL,Youtube-clicked-title,Youtube-clicked-views, Youtube-NOT-clicked-URL, date, shadow-signature} -> std::vector

Experiment on large tiktok dataset-> https://developers.tiktok.com/products/research-api/

synctext commented 10 months ago

Lets focus on writing your master thesis for 1 sprint. Expand the 5-page writings from July 2023
IEEE 2-column
First sketch of experiment description (no do the work yet!)
- measurement plan by writing experimental results thesis chapter
- "in our first experiment we quantify the CPU requirement of our work. We start with 2k Youtube videos to index. We increase the workload stepwise. Results show a nearly linear increase in computational requirements" etc.
Demo only prototype, no Clicklog re-writing upon app.close()
Thesis storyline and experiments
- Can you index Youtube or TikTok?
- Just some handwaving, no coding
- Every smartphone has a unique personalised model
- You know the 20k items closely related to your taste
- non-perfect insert experiment
{repeating} 3 master thesis figures (not more figure needed in entire master thesis article 😮 )
- show performance and convergence between 2 devices (or 1 device and emulator)
- bandwidth usage; performance in cpu,mem, storage?
- Experiment: start with 10k known items, exchange 1 clicklog message per 1 second

Date	Youtube new videos upload rate
January 2009	15 hours of video / min
2019	500 hours / min

quintene commented 9 months ago

Finish up the index buffer overwrite including new embeddings which now is possible: Last embedding is added to closest partition and than checked against the embedding in that partition and searched for since having a (- lower is better) result which is 3 times closer than the others.

output(j, i): -0.46781 i: 172 and w/offset 1249
output(j, i): -0.671608 i: 173 and w/offset 1250
output(j, i): -0.711928 i: 174 and w/offset 1251
output(j, i): -0.601231 i: 175 and w/offset 1252
output(j, i): -0.533054 i: 176 and w/offset 1253
output(j, i): -0.644484 i: 177 and w/offset 1254
output(j, i): -2.3841 i: 178 and w/offset 1255

Results of top 5 items:

id: 1255 with distance: -2.3841
id: 17372 with distance: -0.777172
id: 1077 with distance: -0.761045
id: 1078 with distance: -0.748582
id: 7886 with distance: -0.740518

Remaining todo show new results in SuperApp!
Had a hockey training camp week off.
Working towards draft version of thesis. Todo: Discuss storyline. storyline graduation.pdf

synctext commented 9 months ago

Great progress
Ask an AI what are good scientific articles about decentralised federated learning? :-)
upcoming sprint: present 1 master thesis figure completely. Discuss all lines and why it works, not merely how. What have you done in this experiment, why, and proven outcome?

update: fun fact, Deepmind also uses the library you use :smile: Improving language models by retrieving from trillions of tokens

quintene commented 8 months ago

Finally got evrything working in app!
1. Working custom bind .aar Custom C++ inference!

3. Working custom Index with new data inserted in closest partition!

Working towards: Setup for thesis experiments. Setting up performance tests;
1. CPU usage
2. Memory usage
3. Ideal: Recall (the fraction of true nearest neighbors found, on average over all queries) against Queries per second
TODO WRITING! Much uncertainty about experiments resulted in confusing direction towards conclusion thus overall structure in paper progress.
GOAL: Aiming for deliviry in 6 weeks (26 april) including;
1. Sprint 1: App presentation next sprint + draft
2. (Sprint 1.5) Small thesis feedback session (via mail?)
3. Sprint 2: Delivery.

storyline.graduation.pdf

synctext commented 8 months ago

Awesome results :medal_sports: :1st_place_medal: :medal_sports:
- Please really freeze development, think of the shortest route towards figures, add text
- Switch out of your engineering role and transmorph into scientist
- Focus on experiments with Android emulator
- Expand into full master thesis
- 5 Figures is all you need
- graduate with 8..10 pages of text on arXiv with 4..7 Figures.
- 3 side-by-side screenshots of real-time typing of words and matching semantic results ("r", "re", "red")
- No GUI work please :pray:
- explain everything, also https://huggingface.co/Dimitre/universal-sentence-encoder
- ToDo: transfer all of your .aar and .apk building tools to @OrestisKan
- he can then add binary transfer and 50+ SIM cards support
- transfer 8Million YouTube URLs between 2 "portable AI devices" == Android phones.
This sprint
- already had non-perfect insert working for complex Google k-means library
- Not yet in PeerAI part of Superapp. Now it all works :rocket:
- 20K Youtube URLs, uses LevelDB special format with partitions, stored in 128 dimensional space, compressed using asymmetric hashing
- Key outcome: Bazel build script from scratch for .aar tflitemodel plus .apk
Experiment brainstorm
- pre-trained model already has 20k items inside when put on-device
- What is the cost of adding 10,100,1k or 10k new items to insert?
- {repeating} "in our first experiment we quantify the CPU requirement of our work. Wall clock time! We start with 2k Youtube videos to index. We increase the workload stepwise. Results show a nearly linear increase in computational requirements" etc.
- bucket overflow experiment, the art of exaggeration: 1000% overflow
- centroid mis-alignment with non-perfect inserts, buckets start to go wildly rampant :smile:
- 144 buckets/partitions, for 20k. Detect when overflowing and the effect of overflow. Degrade into linear search of whole dataset?
- 20k items of 1 minute into 2.4MByte or 8M real Youtube URLs. Files can easily be 24 GByte on Android, 10000x :exploding_head:
- No need for network level experiments yet.
- first quantify speed and scalability of k-mean on-device machine learning.
- Keep the network experiments simple!
- Two Android emulators gossip your training data using epidemic IPv8 community logic. ClickLog spreading.

quintene commented 7 months ago

Working Experiment Setup in SuperApp! Including APK.
Not being able to have 2 simulators running simultaneously communicate through ipv8 together resulting in "no other peers are found". Therfore having a simulator + old android phone, works!
Now analyzing communication between ipv8 container and simulator logging benchmark resutls -> bottleneck will be actual file I/O since rewriting leveldb buffer through mmap pointer requires rewrite on actual disk.

synctext commented 7 months ago

great demo!!! Starting to come together. (worked. Adding new embedding only crashed 10-ish times :rofl:)
Legal status of your used embedded webview Youtube player seems fully legit
1000+ movie trailers, https://www.kaggle.com/datasets/dineshvasired/movies-youtube-trailers-and-sentimentdinesh-dinesh?select=movies_youtube_sentiments.csv
Next sprint: produce a thesis figure {repeating} freeze app development.
no gossip experiment yet, first insert experiments please + text.
First finish all experiments + thesis-perfect text
- then add implementation chapter with 3 screenshots of keyword search operation and real-time recommendation.
- no Research Question stuff, just our aim is to re-decentralise The Internet. Fix It. :wrench:

quintene commented 7 months ago

New Large dataset for pretrained model (8M dataset has no title/author only labels per timecode which is not required but also way too big): Therefore: YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.
The collection comprises 15,112,121 original and automatically translated transcripts from 2,063,066 videos (411,432 individual channels). 15M video transcripts could be indexed inside the model only the universal scentence encoder is more effective on english texts.
Learned a model which includes 2M unique video's but we can create a collection of 15M with same video's..
Will cost around a full day to train a model that big, running inference on universal_scentence_encoder creating 2M embeddings which is not that fast on CPU only for loop over a 500mb cs . Currently still waiting for some new pre-trained models to finish to analyze performance and wrapping up experiments section including all graphs.
Interesting result during Overflow Experiment (using same embedding/metadata to overflow specific partition) increasing from same bucket from 156 items to 300K (identical insert) only halves in cpu time. from +- 445ms to 1100ms which is still quite fast. Model size increased from 4MB to 8MB. 14MB for 700k items in a single partition but all on CPU (11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz) 2.630.718 items = 37MB Experiment: Compare 2M dataset with self "inefficient insert".

Results fluctuate much having indipendent runs.

Working on experiment of 2M pretrained model versus NPI Model, in terms of size and speed. Accuracy is a though nut to crack evaluating move into direction of ANN-benchmark metrics such as Queries amount vs recall?
Next steps also include Gossip/clicklog experiment.
Crash fix for Android API 34+ solved: APK https://we.tl/t-A4rw7naT1a
Meanwhile focus on writing but a bit confused about the results having a blown up partition still results in decent performance therefore current status is a bit chaotic and raw.

main.pdf

synctext commented 7 months ago

13 June graduation requires absolute focus :eyeglasses:
Note the master thesis procedure https://www.tudelft.nl/en/student/eemcs-student-portal/education/graduation-msc
{repeating} Next sprint: produce a thesis figure {repeating 2x} freeze app development.
{repeating} thesis-perfect text
- [ ] Figure 1: remove node dimensions, just the blocks
- [ ] Figure 2: 2-column wide
No 2025 graduation please :fearful:
- no new experiments
- No 4 experiments please
- 445ms to 1100ms which is still quite fast. nothing more is needed.
- make pictures. write thesis. DONE :checkered_flag:
Peer-to-peer gossip on 4G/5G works! stable! got rickrolled even when searching for my name :roll_eyes:
Overflow experiment is failing, the library simply refuses to degrade :rofl:
- can you really build a Youtube/Tiktok/Reels alternative with SCANN by Google as the foundation?
move details and screenshot into design and implementation section
Quick roadmap to thesis completion {== performance analysis and experiments section exactly}
- [ ] storage requirements and inference speed
- start very small. Grow dataset in 10 steps. Till 2M or 2.6M. (~~linear or~~ log?)
- Measure for each step the database k-mean size and speed of inference
- discuss the scalability of your work
- [ ] overflow experiment. performance degradation on insert
- [ ] {repeating} Two Android emulators gossip your training data using epidemic IPv8 community logic. ClickLog spreading.
- Continuous random(n) inserts per second. Record traffic.
- Plot traffic (total MByte in experiment, incremental) in time {nothing more please}

quintene commented 6 months ago

2M Dataset trained and used in experiment section. Trained based on: https://huggingface.co/datasets/PleIAs/YouTube-Commons Took 40 hours.
Not 100% sure if network experiment is suffice just measuring gossiped data and Non perfect insert?
Meantime typing, draft
End sprint for upcoming 1,5 weeks finishing up.

2,063,0 ezgif-7-f6743efe39 66 Dataset

Search term -> Query:

Red
Red Red Wine
Red Red Wine UB40
UB40
Green Wine
Vino tinto

synctext commented 6 months ago

you have a time problem :worried:
very worried about the 2 weeks left of writing
Amazing thesis content: scientific publication level
Thesis needs a month of work
Big problem is also that these experiments do not illustrate all the implementation work you have done
- show table of code (Lines of Code per function)
- show a dependency/call graph
- more then just 3 screenshots for the first fully decentralised Spotify-alternative (no MusicDAO, but real Web3 Youtube playback).
- needs 2 pages of attention
First experiment issues a single query to our machine learning system
- The goal of this experiment is to provide a general overview of the capabilities of our system
- Shows the results for "Red Hot Chilly Peppers" within the Spotify/Youtube dataset
- There are 10 soongs within this dataset by this band
- However, the No.8 results is a strange anomaly of our approximate semantic retrieval
- ~~Manual investigation shows "Marshmello", peppers, candy, etc. are mapped to the same thingie ??????~~
Figure 1 example
- Not formal and also not illustrative
- See the beautiful figures here: https://thegenerality.com/agi/
- High quality content! 1 picture explains 12 months of effort :roll_eyes:
[ ] remove all the top-titles from figures like: "Recall vs Speed Time for Different Models"
Figure 2 example
- reversed time X-axis :fearful: :fearful: :fearful:
- This figure has exactly 6 datapoints at 0%,20%,40%,60%,80%,100% recall. Why?
- What is the parameter under study within this experiment?
- Re-frame as "we now examine the performance for random keyword inference".
- We express inference cost as execution time in ms.
- our first inference cost experiment uses six hand-picked queries
- To illustrate the semantic embedding we measure the following six related queries.
- These six are song name+band name, band name only, song name, partial song name, semantic mistake in song name, and foreign language translation of partial song name.
- Strict ordering: "Red Red Wine UB40", "UB40", "Red Red Wine", "Red", "Green Wine", "Vino tinto"
- these six queries have different inference cost on the three datasets
- Figure X show the queries and the resulting inference time for our three dataset
- Our second experiment examines retrieval of multiple results and recall.
- Multiple results are costlier to retrieve than a single item
- Multiple approximate results may not be even present in the original data
- Figure X+1 show the recall for our six illustrative queries.
- We measured the cost of recall@1, recall@2, recall@4, recall@8, and recall@16.
  - use different point-styles for different datasets or recall@x
- These results show that music by band UB40 is only present in a single dataset
Figure 3 example
- You insert 10k items and then measure 350 ms execution time
- However, the measurement outcome is used as X-axis :fearful:
- Conceptually simply re-insert item experiment, 1 million times
- Great 1st experiment, not second.
Fig. 4 example
- We now further examine the overflow of our core machine learning engine, SCANN.
- Our second overflow experiment also inserts one million items
- Instead re-inserts one million identical items we insert different items
- We quantify the effect of one million random inserts, as measured by insert time.
- Simply combine with Figure 3
Decentralised networking experiment
- Our final experiment examines the entire end-to-end pipeline of decentralised content discovery and search
- We quantify the exact cost of content discovery and decentralised search
- Content discovery is based on a gossip protocol, see the illustrative network view https://pstree.cc/wtf-is-gossip/
- We focus on the Creative Commons Youtube dataset which contains videos that can be freely re-distributed.
- For our experiment we focus on the core primitive of two random peers exchanging discovered content and search results
- Search results are a huge lists with pairs of "Query,Clicked-Youtube-URL", called a ClickLog.
- One device in our experiment generates ClickLog, the other inserts these results in SCANN
- Several Figures
- real device as receiver!
- Figure Y: show 100 second experiment of receiving, Y-Axis the number of received items
- Figure Y+1: 100 second experiment, Y-Axis the number of total MBytes of traffic
- Figure Y+2: 100 second experiment, Y-Axis the CPU usage of Android device
productive 3h meeting :-)

quintene commented 6 months ago

main.pdf

https://research.google/blog/soar-new-algorithms-for-even-faster-vector-search-with-scann/

synctext commented 6 months ago

Difficult decision of thesis moment
- current draft has 1 figure
- previously 3 figures with results
- no progress on Decentralised networking experiment.
- just delete the "II. BACKGROUND" section.
Still typos left with 1 week to go: "quantizationm"
This band also has 10 YouTube videos learned. The queries include please turn this bullet list into a Table.
Last minute perfecting of experiments :astonished:
- not finding results has a performance cost.
- Quantification of that is also scientific sound.
- Bad performance of SCANN is also perfect Master thesis material
- No problem to present "much work needed" to realise Edge-AI
Figure 5: remove the line! Please present as bar-chart. These measurements are now a time-series dataset connected with lines.
cite latest Google SCANN improvement, no need to look further into this.

quintene commented 5 months ago

main.pdf

TODO: [ ] Figues still need to fixed position Floating HELL [ ] streamline terms used: Entries = New Songs, test dataset for insert remove dont call "fantasy" songs.. [ ] Replace default scann picture by how song -> Embedding ->Quantinaztation -> Hashing -> CLostest Neighbor. [ ] Conclusion -> Merge all results together. Potentially SCANN but not with non perfect insert. [ ] Abstract [ ] Small section about future work not too heavy [ ] Network experiment needs a bit more context what the results show for Beyond Federated.

synctext commented 5 months ago

Becoming a real thesis, last-minute polish :+1:
No thesis yet: missing 1-page title page for library upload
Problem Description: focus on decentral AI; scientific
Figure 5 and 6: too small figure font; unreadable
Architecture section. {repeating}Content discovery is based on a gossip protocol, see the illustrative network view https://pstree.cc/wtf-is-gossip/
Post-Sunday: title page, room reservation, presentation

update:

example master thesis presentation slides
https://github.com/Tribler/tribler/files/8142085/Defence_Presentation.v1.pdf
https://github.com/Tribler/tribler/files/4947367/Defense3_out.pdf
https://github.com/Tribler/tribler/files/11311259/draftslides5.pdf

comments:

How to search 14 billion videos? (https://www.theatlantic.com/technology/archive/2024/01/how-many-videos-youtube-research/677250/)
Vectors in 1 slide, Figure 1: https://www.researchgate.net/publication/332679657_Metaconcepts_Isolating_Context_in_Word_Embeddings/figures?lo=1

quintene commented 5 months ago

Presentation_structure.pptx