Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.73k stars 445 forks source link

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open synctext opened 10 months ago

synctext commented 10 months ago

ToDo: determine phd focus and scope

Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy Duration: 1 Sep 2023 - 1 sep 2027

First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers :smile: . Run Tribler from the sources.

Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: https://github.com/Tribler/tribler/issues/3868

update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype

pneague commented 9 months ago

I have taken to understanding the work done by Martijn on ticket 42. I read through it and downloaded the code attached.

The last version of the code had a couple of functions not yet implemented so I reverted to the 22-06-2022 version (instead of the last version uploaded on 27-06-2022).

The 22-06-2022 had a few outdated functions and small bugs as well here and there, but since they were small I was able to solve them.

I have downloaded the required dataset and then successfully run the parser and scenario_creating functions implemented by Martijn. After that I ran the experiment itself based on the above-mentioned scenario, resulting in a couple of csv's and graphs.

I understand the general idea of the experiments and how they work, however the code still eludes me since it's not commented to a significant amount. Here's an example of the graph of an experiment run with Martijn's code so far:

synctext commented 9 months ago

Hmmm, very difficult choice. For publications we should focus on something like Web3AI: deploying decentralised artificial intelligence

pneague commented 8 months ago

Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290. The idea at the moment is that we implement a version of Mixture-of-Experts (https://arxiv.org/pdf/2002.04013.pdf) whereby one node sends the query to other nodes which are nearby and receives recommendations. These are then aggregated to create a shortened and sorted list of recommendations for the querying node.

There are 2 design choices: We could send the query-doc_inferior-doc_superior around as gossip or we (as we do at the moment) send the updates around every run. We'll look deeper into these ideas.

One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible.

You can see a demonstration of the prototype below: Alt Text

I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch

synctext commented 8 months ago

It's beyond amazing what you acomplished in 6 weeks after starting your phd. :unicorn: :unicorn: :unicorn: Is the lab now All-In on Distributed AI? :game_die:

Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:

do science-only prototype, see results, try with 1 million items, such as: arxiv 1 million articles. Do a workshop paper, get a nice phd thesis chapter(s).
production-level coding. Get to deployment as-fast-as-possible. Efficiency, storage, and bandwidth cost need to be only compared to the current filename matching. So comparing classic syntactic search with most simple possible semantic search. The different will be gigantic! Anything "simple intelligent" is better than dumb literal keyword query matching.
Science breakthrough. Get the transformers to decentralise. First do a bit of the above, but go beyond current state-of-the-art
- For simplicity, use NanoGPT that our student also got operational. Simple, easy to upgrade, light on resources, but still amazing power with just some Shakespeare dataset training.
- FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning
- Privacy-Preserving Fine-Tuning of Artificial Intelligence (AI) Foundation Models with Federated Learning, Differential Privacy, Offsite Tuning, and Parameter-Efficient Fine-Tuning (PEFT)
- For big scientific publication we need ClickLog data and beat the (federated) state-of-the-art performance
- Kaggle: Recommender Click Logs- Sowiport
- ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search
- Github: DL-Hard. Paper: How Deep is your Learning: the DL-HARD Annotated Deep Learning Dataset
- Github code: FlexNeuART (flex-noo-art) Flexible classic and NeurAl Retrieval Toolkit
- Github code: Reranker is a lightweight, effective and efficient package for training and deploying deep language model :rocket: compatible with HuggingFace transformers and models
Privacy and security: adverserial search are future work or do simple weighted MeritRank for mixture of expert gating function. Selected a few experts/nodes as "guard nodes", like TOR does for privacy and security.

For a Youtube alternative smartphone app we have a single simple network primitive : Query, content-item-clicked, content-item-NOT-clicked, clicked-item-popularity,signature and in TikTok form without queries and added viewing attention time: content-item-long-attention, long-attention-time, content-item-low-attention, low-attention-time, long-attention-item-popularity,signature. Usable for content discovery, cold starts, content recommendation, and obviously semantic search.

Next sprint goal: get a performance graph! We need to get a paper on this soon, because the field is moving at lightning speed. So up and running before X-Mas, Tribler test deployment, and usage of NanoGPT in Jan, paper in Feb :rocket:

pneague commented 8 months ago

After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that).

The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that).

So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified: layers = [ ('lin1', nn.Linear(768, 256)), # encoded query, 768 dimensions ('relu1', nn.ReLU()), ('lin2', nn.Linear(256, 256)), ('relu2', nn.ReLU()), ('lin3', nn.Linear(256, 256)), ('relu3', nn.ReLU()), ('lin4', nn.Linear(256, number_of_documents)), # output probabilities ] In my preliminary tests so far, when we have 884 documents (i.e. 884 output neurons) we can perform 50 searches in 4 seconds (so about one search per 0.08 seconds). When we have 1066561 documents, 50 searches get completed in 200 seconds (one search per 4 seconds). Under some circumstances this may be acceptable for Tribler users but people with older computers might experience significant difficulties. I will need to look at ways of reducing the computation time required.

Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train.

qstokkink commented 8 months ago

Cool stuff 👍 Could you tell me more about your performance metrics? I have two questions:

Are these are SIMD results (i.e., one batch of 50 searches take 200 seconds but a batch with 1 search also takes 200 seconds)?
What hardware did you use (e.g., CPU, some crappy laptop GPU, HPC node with 10 Tesla V100's, ..)?

This matters a lot for deployment in Tribler.

pneague commented 8 months ago

They are not SIMD. One search actually takes 1/50'th of the mentioned time
I used a Mac laptop with M2 Pro Chip

But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements

synctext commented 7 months ago

Paper idea to try out for 2 weeks:

Problem: decentralised learn-to-rank with full scalability
Rush Paper. Maximise personal development within phd&learning speed, be the first workshop-level paper, 5 Jan 2024 deadline: https://www.wis.ewi.tudelft.nl/dbml2024
Dataset either use prior work dataset or ORCAS:
- https://ai.google.com/research/NaturalQuestions/download
- https://github.com/microsoft/msmarco/blob/master/ORCAS.md
Central algorithm paper as starting point: Transformer Memory as a Differentiable Search Index
- Running code of algorithm: https://github.com/ArvinZhuang/DSI-transformers
- https://github.com/ArvinZhuang/DSI-QG (superior results by Microsoft people: https://arxiv.org/pdf/2206.10128.pdf)
- Weights: https://huggingface.co/ielabgroup/xor-tydi-docTquery-mt5-large
Keep everything as simple as possible!!! (global ranking, fixed query-doc relation, no popularity, no personalisation)
Goal: 2 IPv8 community peers gossip: Query, item-clicked,~~item-clicked-popularity~~, item-not-clicked, date, signature.
@mg98
- Get NanoGPT operational on DAS6 GPUs
- existing weights, transfer learning
- first graph, a loss function for trivial training and testing data
@pneague
- replace existing model with NanoGPT
- get the training going
- {future sprint} Strategy for item-not-clicked? Use simply another item in top-K. Can we use co-occurence (same query, different docs) to find a non-click substitute?
add some database angle for workshop scope: also run these queries with Postgress and see how fast it is in classical methodology.

LLM for search related work example on Github called vimGPT:

https://github.com/ishan0102/vimGPT/assets/47067154/467be2ac-7e8d-47de-af89-5bb6f51c1c31

pneague commented 7 months ago

I got the T5 LLM to generate the ID's of ORCAS documents. Current Setup:

From entire dataset, I took 100 documents which have around 600 queries associated with them each, yielding around 60k query-document pairs. No query-document pair appears more than once.
I split the dataset into train/test with a split factor of 50%
Two agents read the same data from the disk, initially the train set
They send each other sequentially every row of the data (which at this point looks like [query, doc_id] )
They train on the message received but not the one sent (as they both have the same data I'm avoiding training on the same data twice)
The model predicts the doc_id given a query
After all train_dataset has been iterated through, I count this as an epoch and I iterate through it all over again. I count the number of times the doc_id was guessed by the model and this is how I calculated accuracy
After each 'epoch', if accuracy on train set reaches >=90% I saved the model and tokenizer
Training took about 12 hours
Then I calculate accuracy on the test set using the same method (but without training on the new data)
This way, accuracy on the test set was found to be 93%, proving that the model has a high potential to generalize

I was looking for what to do moving forward.

I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan.

synctext commented 7 months ago

Amazing progress yet again :boom:
Paper at prior edition of target workshop: https://people.eng.unimelb.edu.au/jianzhongq/papers/DBML2023_HybridSpatialIndex.pdf
Active learning, asking user about metadata https://doi.org/10.1145/1899412.1899414 (old-skool crowdsourcing)
10 users, produce first loss graph, recall results, or something performance graph
Determine storyline: "decentral Google?" Write the first 2-page of experiment and goal text. No intro or related work yet.

update Please try to think a bit already about the next step/article idea for upcoming summer :sun_with_face: :tropical_drink: ? Can you think of something where users donate their GPU to Tribler and get a boost in their MeritRank as a reward :1st_place_medal: :heavy_plus_sign: the Marcel angle of "active learning" by donating perfect metadata. Obviously we need the ClickLog deployment and crawling deployed first.

pneague commented 6 months ago

In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs.

The mechanism implemented is the following:

a number of 100 documents per available peer is selected from the entire ORCAS dataset from the beginning to act as the actual dataset
from this, the new dataset is split into train/test datasets, keeping the ratio of each document in the dataset equal (so if there are 20 queries for a document, 10 will be in the train set and 10 in the test set). I've hardcoded that no documents appear which have only 1 query associated with them, meaning they would have appeared only on the train or test sets. The test set is excluded from training, only the data from the train set is sampled in the training process;
from the documents available, each peer samples a random number between 80 and 120 documents that act as the peers own dataset. Peers may sample documents which have already been sampled by somebody else. In total for the experiment with 10 peers, 661 documents were sampled by at least 1 peer out of 1000 (100 docs per peer * 10 peers);
each peer initiates its own T5 model (small version) and sets it to train mode;
training is now performed in batches of 32. Each peer has a list (corresponding to the batch-data) containing the query and another list containing the doc_id. When the list reaches 32 items, the peer trains its model on the data from those 2 lists and then resets them;
every 0.1 seconds, each peer selects a random query-doc_id pair from its own dataset and sends it to another random peer, but does not append to its own current_batch_list. This is done to not agglomerate the training with a peers own data more than the data of the other peers. So each peer appends data (equal to 32 / nbr_of_peers_currently_identified) to its own batch_list when the it is empty. This way we can more or less control that the data fed into the model of each peer is approximately equal probability to come from any peer in the network, including the current peer;
I've tried experiments with 2, 10, 32 peers so far. The experiments with 2 and 10 peers have performed well. For the case with 10 peers, training was finished within 6 hours and they all have an accuracy of 99-100% on the train set and 90-91% on the test set (for the 661 sampled documents out of 1000). The experiment with 32 peers ran out of RAM memory (as each peer holds its own model) and started performing erratically, I don't think we can trust those results. I've talked with Sandip and got an account for DAS6 as I don't think we can scale the experiments more without a training server. I'll be working to understand how to use it;

For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases.

synctext commented 6 months ago

AI with access to all human knowledge, art, and entertainment.

AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy. Who would own this AGI? Our dream is to contribute to this goal by pioneering a new ownership model for AI and novel model for training. AI should be public and contribute to the common good. More then just open weights, full democratic self-governance. Open problem is how to govern such a project and devise a single roadmap with conflicting expert opinions. Current transformer-based AI has significant knowledge gaps, needs thousands or even millions of people to tune. Needs the Wikipedia paradigm! Gemini example: what is the most popular Youtube video. The state-of-the-art AI fails to understand the concept of media popularity, front-page coverage, and the modern attention economy in general.

It all starts with Learn-to-Rank in full decentral setting {current ongoing work}
Unlock swarm-based data
Continuous learning at next level: eternal learning
Get a few thousand people to contribute (e.g. like Linux,Wikipedia,Bittorrent,Bitcoin, etc.)

Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.)

synctext commented 5 months ago

Public AI with associative democracy

Who owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI.

We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance. Numerous proposals have been made for making AI safe, democratic, and public. Yet, these proposal are often grounded exclusively in either philosophy or technology. Technological experts from the builders of databases, Operating Systems, and clouds rarely interact with the experts whom deep understand the question 'who has control'? Democracy is still a contested concept after centuries. Self-governance is the topic of active research, both in the world of atoms and the world of bits. Complex collective infrastructure with self-governance is an emerging scientific field. Companies such as OpenAI run on selling their AI dream to ageing companies such as Microsoft. There is great need for market competition and a fine-grained supply chain. Furthermore, lack of fine-grained competition in a supply chain ecosystem is hampering progress. Real world performance results irrefutably show that the model architecture is not really that important, it can be classical transformers, Mamba, SSM, or RWKV. The training set dominates the AI effectiveness equation. Each iteration brings more small improvements to a whole ecosystems, all based on human intelligence. Collective engineering on collective infrastructure is the key building blocks towards creating intelligence superior to the human intellect.

AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012. However, such self-ruled communities only play a minor role in the AI ecosystem today. The dominating AI architecture is fundamentally unfair. AI is expensive and requires huge investments. An exclusive game for the global tech elite. Elon Musk compared the ongoing AI race to a game of poker, with table stakes of a few billion dollars a year. Such steep training costs and limited access to GPUs causes Big Tech to dominate this field. These hurdles notably affect small firms and research bodies, constraining their progress in the field. Our ecosystem splits the ecosystem by creating isolating competitive markets for GPU renting and training set storage. Our novel training model brings significant synergy, similar to the Linux and Wikipedia efforts. By splitting the architecture and having fine-grained competition between efforts the total system efficiency is significantly boosted. It enables independent evolution of dataset gathering, data storage, GPU rental, and AI models. Our third pioneering element is the democratic access to GPU hardware. One branch of distributed machine learning studies egalitarian architectures, even a tiny smartphone can be used to contribute to the collective. A billion smartphones, in theory, could significantly outsmart expensive hardware. Wikipedia and Linux have proven that you can't compete with free. We mastered the distributed, permissionless, and egalitarian aspects of AI. The next stage of evolution is to add democratic decision making processes. A team of 60 master students is currently attempting to engineering this world-first innovation collectively. Another huge evolutionary leap is AI with access to all human knowledge, art, and entertainment. Currently datasets and training hardware are expensive to gather and store. For instance, the open access movement to scientific knowledge has not yet succeeded in creating a single repository. The training of next-generation AI requires completion of this task. All creative commons content (text,audio,video,DNA,robotics,3D) should be scripted in a expanding living dataset, similar to SuperGLUE set-of-datasets. Cardinal problem is building trust in the data, accurancy, and legal status. We pioneered in prior work a collective data vault based on passport-grade digital identity.

pneague commented 5 months ago

In the last few weeks I had run experiments with ensembles of peers. The experiments with more than 10 peers makes the laptop run out of RAM memory and starts acting weirdly so I had to change the direction my work. The current idea is that T5 small is not able to fit inside its weights that many doc_id's (because it is so small). But we need it to be small for it to run on Tribler peer computers.

So in order to increase the number of retrievable documents I thought of sharding the datasets, with each shard having its own peers. In the experiments performed, each shard consists of 10 peers.

Each of the experiments was successfully run, with each peer achieving good results on the shard's test set (as described in a previous entry here).
Each shard was trained in an independent run so the laptop I'm using wouldn't run out of RAM memory.
Each shard had different doc-id's from the other shards.
I've used 5000 documents per shard and let each peer catch a random number of documents between 200 and 300 (as in the previous entry).
Documents not chosen by any peer were discarded.
After successfully training all models on their respective shards, I experimented with using ensembles to aggregate the results of multiple shards. Initially, the idea was that the system would pick a random number of models, belonging to all shards, and each picked model would vote on a document_id, given a query. But this relied on chance picking the models belonging to the right shard for each tested query. Marcel came up with the idea that we could in principle gossip the shard-number of each peer and then we would know to ask models from each shard given a query.
The idea was that models trained on the right data would pick the correct document (as each had a top1 accuracy of 90%), while models not trained on the right data would output either random documents, different from one model to the other, or hallucinate doc_id's, different from one model to the other. So when we see two models voting for the same doc_id, we know that they were trained on data matching the query in question.
Another ensemble idea was to get the top5 results for a query with beam-search, and get their model-scores for those 5 beams. After that we could take softmax of the 5 results so we know the confidence that the model has on each of them. Then, instead of summing the number of times a result was suggested by a model, we would sum the confidences of each model for each result.
At the moment I'm still running some experiments but here are the accuracy results for each shard: The image above depicts the accuracy on the test set of each shard of each peer belonging to that shard. Blue is top1 accuracy, and red is top5 accuracy (obtained with beam-search).

Model Ensemble from different shards drawio This diagram shows how a 2-shard ensemble would work in the voting and confidence mechanism (in the previous iteration where the models were chosen randomly, without caring how many models we get from each shard)

synctext commented 4 months ago

Solid progress! Operational decentralised machine learning :rocket: :rocket: :rocket: De-DSI for the win.

Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era

update: Nature paper :astonished: Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models this work outputs entities and their relationships as JSON documents or other hierarchical structures

pneague commented 3 months ago

Fresh results from DAS6 for magnet link prediction: 1000 docs - 90.5% 5000 docs - 77% 10000 docs - 65%

See comparison between predicting docids vs magnet links:

When the dataset is relatively small, the accuracies are the same for both top-1 and top-5. As more data appears in the dataset, we can see a divergence in the accuracies posted in both metrics. We hypothesize that the limited number of weights in our model efficiently captures URL patterns in scenarios with sparse data. However, as the data complexity increases, this constraint appears to hinder the model’s ability to accurately recall the exact sequence of tokens in each URLs. This is merely a guess, and we intend to investigate this further in future work. However, the observed discrepancy in accuracy levels remains marginal, amounting to merely a few percentage points across a corpus of 10 K documents.

pneague commented 2 months ago

Poster for the De-DSI paper: De-DSI Poster.pdf

pneague commented 1 month ago

One of the ideas to further develop the De-DSI paper was to perform the sharding division of documents in a semantically meaningful way. This is what I've done in the past couple of weeks.

So the problem was that if you shard documents randomly, you have 2 similar documents in different shards and when querying all shards for one of the 2 documents, you'd get high confidence on both shards. This leads to a 50/50 chance that the correct shard will have a higher confidence than the incorrect one.

The idea was to perform semantic sharding such that all documents of a type would be in one shard. This would resolve the confusions between shards as each one would know which document needs to be retrieved and the others will have low confidence in their result this way.

So I:

Trained 10 T5 models on 1k docs each, with docs having at least 10 queries each and got the ensemble accuracy
Got the embeddings of the T5-small of all queries for each document and averaged the query-embeddings to obtain the embedding of the document;
Afterwards I used K-means to get the splitting done. K in this case is the number of shards = 10;
Trained 10 T5 models on the documents in each cluster calculated with K-means, and got the ensemble results of that;
Plotted the accuracy distribution in boxplots for top1-top5. Each boxplot represents the accuracy on the dataset of 10 shards (aggregated) by either the individual shard or the ensemble for both the random-sharding and semantic-sharding setting.

I compared the results and it turns out it doesn't work as hoped

I believe the issue is that if shards have semantically meaningful documents, it is harder to distinguish between them and so the confidence of the correct shard is lower than before. This means that more-different-but-still-slightly-similar documents which are in other shards have a higher chance than before to beat the confidence of the correct document in the correct shard.

I thought about what exactly I could do about this but I haven't come up with anything yet. Jeremie recommended I look into fully decentralized ML training which is resistant to some kind of attacks. I have an idea on how it may be done but I need to read more on it first as its a new topic to me.

pneague commented 1 month ago

In the last few days I've read papers on

federated learning and how gradient passing is anonymized;
personalized federated learning;

I also thought about how a mixture-of-experts with multi-layered semantic sharding would work. At the moment something that I could try would be:

Take 10.000 documents and get the average of the queries of each as the representation of the document as before
Use K-means with K = 5 to split the documents into 5 shards, assign to each a number 1-5
Then, for each of the 5 K's, consider only the documents belonging to the shard and perform another K-means with K = 5, assign to each a number 1-5
Then, for each of the last 5 K's, perform another K-means with K = 4, assign to each a number 1-4
Thus ending up with a nested sharding method for 554 = 100 shards
Use a master DSI model in a mixture-of-experts method to predict the ID of the shard, e.g. 2-1-4 would represent the shard belonging to cluster 2-1-4;
Having the shard I can ask it specifically to give me the prediction, or I could take the ensemble;
If using an ensemble, this would still have the issue that slightly less relevant documents present in another shard would outcompete in confidence the correct document present in a shard which has many relevant documents. Thus, I would predict this not to work super well in ensembles, but I am not 100% sure of it

I also haven't found any paper on personalized models in decentralized federated learning, so it would be a gap which is unexplored and thus maybe easy to publish about.

synctext commented 1 month ago

Focus on finding a phd problem to solve. Avoid "Technology push" that makes much science useless. We need GPU's for training. We need a dataset. We need publishable problem.

Perhaps it is time to dive for 3 weeks into a production system? Some ideas and links

Aggregate information is extremely costly in distributed systems. Concepts like content popularity always involve the opinion of strangers, therefore nearly impossible to secure against spam and abuse. Sybils can be real, see how Google is captured by SEO and clickbait dominates influencer culture.
- spam in decentralised systems is fundamentally unsolved
- we can have anonimity, but we need long-lived identities for strangers we re-encounter
- Our lab paper: Failures of public key infrastructure: 53 year survey
- track their merit collectively, many times proposed. Never realised.
- Bitcoin and Bittorrent are the only true decentralised system in existance
- very spammy!
- note: transaction spam is a thing
Bulat his MeritRank deployment is obviously in scope. master student work, Web3Recommend: Decentralised recommendations with trust and relevance
Trust problems and spam are very real. The tag spam issue of "Justin Bieber is Gay"
Content popularity measurement
Tags are popular scientific topic: Deep Learning YouTube Video Tags
Our decentralised personalisation from 2005. P2P-based PVR recommendation using friends, taste buddies and superpeers
How to make a frontpage of decentralised Youtube/Tiktok? Trending content related work.
Spotify/Youtube content of 10k-400k songs is easy to find
- 7th place solution to The 3rd YouTube-8M Video Understanding Challenge
- Single top 10k songs

Hipster publishable idea: secure information dissemination for decentralised AI (e.g. MeritRank, clicklog, long-lived ID, sharing data, not unverifiable vector of gradient decent)

pneague commented 2 weeks ago

In the last few weeks I looked into methods of estimating reputation and sybil-defense in a graph network by using ML models. There are quite a few methods for doing stuff like this in all types of areas, for example in edge-computing devices, social networks etc.

After talking with Bulat, he suggested we could try to use Meritrank and some kind of model to limit the amount of resources that a sybil attack could sap from the network. The idea is still in the incipient phase and it's not clear to me if it works. Bulat suggested that instead of doing what the other papers have done (for example the papers doing reputation estimation with social networks were using social network information to find sybils), we could try to do this solely by using the graph data. I'm not sure if this is possible but I think it's in the realm of possibility.

Additionally, we would not use a supervised-learning method where we have the sybils clearly mapped, but get a dataset where we assume all members of the graph to be honest, and then perform all types of sybil attacks possible on the network and see if we can limit how much attackers gain from this somehow. We could also implement methods of previous papers and compare our results to theirs in a situation where all types of sybil attack is simulated. Bulat mentioned he doesn't know of a paper taking this approach so far.

I have also talked with Quinten about the dataset his code is collecting. It's interesting but not very rich, even if we may have lots of data. You can see a very small sample meant as an example below: .

Basically we have query, infohash, score, parent_query_forwarding_pk. The score is calculated as thus: If you search for a query and you click a link and you don't search for the same query again, you're assumed to be satisfied with the link, so the score = 1.0 If you search for a query and you click a link and you are not satisfied, you search for the query again. If you click the second link and you are satisfied you stop there. Thus, the first link clicked has a score of 0.2 and the second link clicked has a score of 0.8.

This is interesting, and may provide a way to get reputation (for the person who's seeding the content in the first link and for the person who's gossipping the queries). But I am not sure we can do it well if we don't have that many users vs number_of_links_available. We'll have to see how much data we end up with in a few months.

synctext commented 2 weeks ago

personalisation would be a great phd topic to explore
- Sep 2025 you need another phd thesis chapter finished/published
- By X-Mas 2024 you need an idea.
- So just semi-random walks and goal-oriented exploring
- Learn-by-doing :point_left: :exclamation:
- Work towards 1 graph of results which can be expanded into a whole thesis chapter
Start a practical 3 week sprint?
Building a collaborative filtering recommender from scratch in Python
Not designed for Tribler, but for phd learning
Fill with Movielens data, lot of existing code you can re-use for quick insight
See @mg98 his MovieLens explorations :-)
- using tags to understand items
- this will boost quality tremendously
- Semantic understanding, something we in the future could feed into LLM + De-DSI :racing_car:
Or use Creative Commons content, there is audio content and user profiles available for music
Outcome: single amazing .GIF .... Tada🙆🏻‍♂️🙆🏼‍♂️

btw about teaching...prepare for helping out with msc students more + master course of Blockchain Engineering.

update : machine learning for 1) personalisation 2) de-DSI content discovery 3) decentralised seeder content discovery {DHT becomes :point_right: IPv4 generative AI} 4) sybil protection 5) spam protection 6) learn-to-rank