Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.73k stars 445 forks source link

Phd Placeholder: learn-to-rank, decentralised AI, on-device AI, something. #7586

Open synctext opened 10 months ago

synctext commented 10 months ago

ToDo: determine phd focus and scope

Phd Funding project: https://www.tudelft.nl/en/2020/tu-delft/eur33m-research-funding-to-establish-trust-in-the-internet-economy Duration: 1 Sep 2023 - 1 sep 2027

First weeks: reading and learning. See this looong Tribler reading list of 1999-2023 papers, the "short version". Long version is 236 papers :smile: . Run Tribler from the sources.

Before doing fancy decentralised machine learning, learn-to-rank; first have stability, semantic search, and classical algorithms deployed. Current Dev team focus: https://github.com/Tribler/tribler/issues/3868

update: Sprint focus? reading more Tribler articles and get this code going again: https://github.com/devos50/decentralized-rules-prototype

pneague commented 9 months ago

I have taken to understanding the work done by Martijn on ticket 42. I read through it and downloaded the code attached.

The last version of the code had a couple of functions not yet implemented so I reverted to the 22-06-2022 version (instead of the last version uploaded on 27-06-2022).

The 22-06-2022 had a few outdated functions and small bugs as well here and there, but since they were small I was able to solve them.

I have downloaded the required dataset and then successfully run the parser and scenario_creating functions implemented by Martijn. After that I ran the experiment itself based on the above-mentioned scenario, resulting in a couple of csv's and graphs.

I understand the general idea of the experiments and how they work, however the code still eludes me since it's not commented to a significant amount. Here's an example of the graph of an experiment run with Martijn's code so far: image

synctext commented 9 months ago

Hmmm, very difficult choice. For publications we should focus on something like Web3AI: deploying decentralised artificial intelligence

pneague commented 8 months ago

Re-read papers regarding learn-to-rank and learned how to use the IPV8. With it I created an algorithm which simulates a number of nodes and sends messages to one another. From here I worked with Marcel and started implementing a system whereby one node sends a query to the swarm and then receives recommendations of content back from it. The progress is detailed in ticket 7290. The idea at the moment is that we implement a version of Mixture-of-Experts (https://arxiv.org/pdf/2002.04013.pdf) whereby one node sends the query to other nodes which are nearby and receives recommendations. These are then aggregated to create a shortened and sorted list of recommendations for the querying node.

There are 2 design choices: We could send the query-doc_inferior-doc_superior around as gossip or we (as we do at the moment) send the updates around every run. We'll look deeper into these ideas.

One issue discovered was regarding the size of the IPV8 network packet which is currently smaller than the entire model serialized with Pytorch, Marcel is currently working on that. We have 720k weights at the moment, and the maximum network packet size for IPV8 is 2.7MB so we have to fit in as many weight updates as possible.

You can see a demonstration of the prototype below: Alt Text

I'm currently working on how to aggregate the recommendations of the swarm (for example, what happens if the recommendations of each node which received the query are entirely different). My branch on Marcel's repository: https://github.com/mg98/p2p-ol2r/tree/petrus-branch

synctext commented 8 months ago

It's beyond amazing what you acomplished in 6 weeks after starting your phd. :unicorn: :unicorn: :unicorn: Is the lab now All-In on Distributed AI? :game_die:

Can we upgrade to transformers? That is the cardinal question for scientific output. We had Distributed AI in unusable form deployed already in 2012 within our Tribler network. Doing model updates is too complex compared to simple starting with sending training triplets around in a IPv8 community. The key is simplicity, ease of deployment, correctness, and ease of debugging. Nobody has a self-organising live AI with lifelong learning, as you have today in embryonic form. We even removed our deployed clicklog code in 2015 because it was not good enough. Options:

For a Youtube alternative smartphone app we have a single simple network primitive : Query, content-item-clicked, content-item-NOT-clicked, clicked-item-popularity,signature and in TikTok form without queries and added viewing attention time: content-item-long-attention, long-attention-time, content-item-low-attention, low-attention-time, long-attention-item-popularity,signature. Usable for content discovery, cold starts, content recommendation, and obviously semantic search.

Next sprint goal: get a performance graph! We need to get a paper on this soon, because the field is moving at lightning speed. So up and running before X-Mas, Tribler test deployment, and usage of NanoGPT in Jan, paper in Feb :rocket:

pneague commented 8 months ago

After looking into what datasets we could use for training a hypothetical model, I found ORCAS which consists of almost 20 million queries and the relevant website link given the query. It is compiled by Microsoft and it represents searches made on Bing in a period of a few months (with a few caveats to preserve privacy, such as showing only queries which have been searched a number of times and not showing a user_ID and stuff like that).

The data seems good, but the fact that we have links instead of titles of documents made it impossible to use the triplet model we have right now (where we need to calculate the 768 dimension embedding of the title of the document: since we don't have a document-title and only a link we cannot do that).

So I was looking for another model architecture to be usable in our predicament and I found Transformer Memory as a Differentiable Search Index. The paper argues that instead of using a dual-encoder method (where we encode the query and the document on the same space and then find the document which is nearest neighbour to the query) we can use the differentiable-search-index (DSI), where we have a neural network map directly the query to the document. The paper presents a number of methods to achieve this but the easiest one to implement for me at this time was to simply assign each document one number, have the output layer of the network be composed of the same number of neutrons as the number of documents and make the network essentially assign probabilities to each document, given a query. Additionally, the paper performs this work with a Transformer architecture, raising the possibility of us integrating Nanogpt into the future architecture.

I got to implement an intermediary version of the network whereby the same encoder that Marcel used (the allenai/specter language model) encodes a query and the output is the probability for each document individually. The rest of the architecture is left unmodified: layers = [ ('lin1', nn.Linear(768, 256)), # encoded query, 768 dimensions ('relu1', nn.ReLU()), ('lin2', nn.Linear(256, 256)), ('relu2', nn.ReLU()), ('lin3', nn.Linear(256, 256)), ('relu3', nn.ReLU()), ('lin4', nn.Linear(256, number_of_documents)), # output probabilities ] In my preliminary tests so far, when we have 884 documents (i.e. 884 output neurons) we can perform 50 searches in 4 seconds (so about one search per 0.08 seconds). When we have 1066561 documents, 50 searches get completed in 200 seconds (one search per 4 seconds). Under some circumstances this may be acceptable for Tribler users but people with older computers might experience significant difficulties. I will need to look at ways of reducing the computation time required.

Moving forward, I'm looking to finally implement a good number of peers in a network that send each other the query and answer (from ORCAS) and get the model to train.

qstokkink commented 8 months ago

Cool stuff ๐Ÿ‘ Could you tell me more about your performance metrics? I have two questions:

  1. Are these are SIMD results (i.e., one batch of 50 searches take 200 seconds but a batch with 1 search also takes 200 seconds)?
  2. What hardware did you use (e.g., CPU, some crappy laptop GPU, HPC node with 10 Tesla V100's, ..)?

This matters a lot for deployment in Tribler.

pneague commented 8 months ago
  1. They are not SIMD. One search actually takes 1/50'th of the mentioned time
  2. I used a Mac laptop with M2 Pro Chip

But keep in mind, this is extremely preliminary, I did not implement NanoGPT with this setup so that's bound to increase computing requirements

synctext commented 7 months ago

Paper idea to try out for 2 weeks:

LLM for search related work example on Github called vimGPT:

https://github.com/ishan0102/vimGPT/assets/47067154/467be2ac-7e8d-47de-af89-5bb6f51c1c31

pneague commented 7 months ago

I got the T5 LLM to generate the ID's of ORCAS documents. Current Setup:

I was looking for what to do moving forward.

I found a paper survey on the use of LLM's in the context of information retrieval. It was very informational, there's a LOT of research in this area at the moment. Made a list of 23 papers which were referenced there that I'm planning to go through at an accelerated pace. At the moment I'm still wondering what to do next to make the work I've already performed publishable by the conference on the 5'th of Jan.

synctext commented 7 months ago

update Please try to think a bit already about the next step/article idea for upcoming summer :sun_with_face: :tropical_drink: ? Can you think of something where users donate their GPU to Tribler and get a boost in their MeritRank as a reward :1st_place_medal: :heavy_plus_sign: the Marcel angle of "active learning" by donating perfect metadata. Obviously we need the ClickLog deployment and crawling deployed first.

pneague commented 6 months ago

In the past weeks I've managed to introduce 10 users who send each other query-doc_id pairs.

The mechanism implemented is the following:

For the future I think trying to use DAS6 to perform a test with 100 peers may be worthwhile to check the integrity of the model and the evolution as the number of peers increases.

synctext commented 6 months ago

AI with access to all human knowledge, art, and entertainment.

AGI could help humanity by developing new drugs, treatments for diseases, and turbocharging the global economy. Who would own this AGI? Our dream is to contribute to this goal by pioneering a new ownership model for AI and novel model for training. AI should be public and contribute to the common good. More then just open weights, full democratic self-governance. Open problem is how to govern such a project and devise a single roadmap with conflicting expert opinions. Current transformer-based AI has significant knowledge gaps, needs thousands or even millions of people to tune. Needs the Wikipedia paradigm! Gemini example: what is the most popular Youtube video. The state-of-the-art AI fails to understand the concept of media popularity, front-page coverage, and the modern attention economy in general.

Related: How is AI impacting science? (Metascience 2023 Conference in Washington, D.C., May 2023.)

synctext commented 5 months ago

Public AI with associative democracy

Who owns AI? Who owns The Internet, Bitcoin, and Bittorrent? We applied public infrastructure principles to AI. We build an AI ecosystem which is owned by both nobody and everybody. The results is a democratically self-governing association for AI.

We pioneered 1) a new ownership model for AI, 2) novel model for training, and 3) competitive access to GPU hardware. AI should be public and contribute to the common good. More then just open weights, we envision full democratic self-governance. Numerous proposals have been made for making AI safe, democratic, and public. Yet, these proposal are often grounded exclusively in either philosophy or technology. Technological experts from the builders of databases, Operating Systems, and clouds rarely interact with the experts whom deep understand the question 'who has control'? Democracy is still a contested concept after centuries. Self-governance is the topic of active research, both in the world of atoms and the world of bits. Complex collective infrastructure with self-governance is an emerging scientific field. Companies such as OpenAI run on selling their AI dream to ageing companies such as Microsoft. There is great need for market competition and a fine-grained supply chain. Furthermore, lack of fine-grained competition in a supply chain ecosystem is hampering progress. Real world performance results irrefutably show that the model architecture is not really that important, it can be classical transformers, Mamba, SSM, or RWKV. The training set dominates the AI effectiveness equation. Each iteration brings more small improvements to a whole ecosystems, all based on human intelligence. Collective engineering on collective infrastructure is the key building blocks towards creating intelligence superior to the human intellect.

AI improvements are a social process! The process of create long-enduring communities is to slowly grow and evolve them. The first permissionless open source machine learning infrastructure was Internet-deployed in 2012. However, such self-ruled communities only play a minor role in the AI ecosystem today. The dominating AI architecture is fundamentally unfair. AI is expensive and requires huge investments. An exclusive game for the global tech elite. Elon Musk compared the ongoing AI race to a game of poker, with table stakes of a few billion dollars a year. Such steep training costs and limited access to GPUs causes Big Tech to dominate this field. These hurdles notably affect small firms and research bodies, constraining their progress in the field. Our ecosystem splits the ecosystem by creating isolating competitive markets for GPU renting and training set storage. Our novel training model brings significant synergy, similar to the Linux and Wikipedia efforts. By splitting the architecture and having fine-grained competition between efforts the total system efficiency is significantly boosted. It enables independent evolution of dataset gathering, data storage, GPU rental, and AI models. Our third pioneering element is the democratic access to GPU hardware. One branch of distributed machine learning studies egalitarian architectures, even a tiny smartphone can be used to contribute to the collective. A billion smartphones, in theory, could significantly outsmart expensive hardware. Wikipedia and Linux have proven that you can't compete with free. We mastered the distributed, permissionless, and egalitarian aspects of AI. The next stage of evolution is to add democratic decision making processes. A team of 60 master students is currently attempting to engineering this world-first innovation collectively. Another huge evolutionary leap is AI with access to all human knowledge, art, and entertainment. Currently datasets and training hardware are expensive to gather and store. For instance, the open access movement to scientific knowledge has not yet succeeded in creating a single repository. The training of next-generation AI requires completion of this task. All creative commons content (text,audio,video,DNA,robotics,3D) should be scripted in a expanding living dataset, similar to SuperGLUE set-of-datasets. Cardinal problem is building trust in the data, accurancy, and legal status. We pioneered in prior work a collective data vault based on passport-grade digital identity.

pneague commented 5 months ago

In the last few weeks I had run experiments with ensembles of peers. The experiments with more than 10 peers makes the laptop run out of RAM memory and starts acting weirdly so I had to change the direction my work. The current idea is that T5 small is not able to fit inside its weights that many doc_id's (because it is so small). But we need it to be small for it to run on Tribler peer computers.

So in order to increase the number of retrievable documents I thought of sharding the datasets, with each shard having its own peers. In the experiments performed, each shard consists of 10 peers.

Model Ensemble from different shards drawio This diagram shows how a 2-shard ensemble would work in the voting and confidence mechanism (in the previous iteration where the models were chosen randomly, without caring how many models we get from each shard)

synctext commented 4 months ago

Solid progress! Operational decentralised machine learning :rocket: :rocket: :rocket: De-DSI for the win.

Possible next step is enabling unbounded scalability and on-device LLM. See Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis or the knowledge graph direction. We might want to schedule both! New hardware will come for the on-device 1-bit LLM era

update: Nature paper :astonished: Uses LLM for parsing of 1200 sentences and 1100 abstracts of scientific papers. Avoids the hard work of PDF knowledge extraction. Structured information extraction from scientific text with large language models this work outputs entities and their relationships as JSON documents or other hierarchical structures

pneague commented 3 months ago

Fresh results from DAS6 for magnet link prediction: 1000 docs - 90.5% 5000 docs - 77% 10000 docs - 65%

See comparison between predicting docids vs magnet links: image

When the dataset is relatively small, the accuracies are the same for both top-1 and top-5. As more data appears in the dataset, we can see a divergence in the accuracies posted in both metrics. We hypothesize that the limited number of weights in our model efficiently captures URL patterns in scenarios with sparse data. However, as the data complexity increases, this constraint appears to hinder the modelโ€™s ability to accurately recall the exact sequence of tokens in each URLs. This is merely a guess, and we intend to investigate this further in future work. However, the observed discrepancy in accuracy levels remains marginal, amounting to merely a few percentage points across a corpus of 10 K documents.

pneague commented 2 months ago

Poster for the De-DSI paper: De-DSI Poster.pdf

pneague commented 1 month ago

One of the ideas to further develop the De-DSI paper was to perform the sharding division of documents in a semantically meaningful way. This is what I've done in the past couple of weeks.

So the problem was that if you shard documents randomly, you have 2 similar documents in different shards and when querying all shards for one of the 2 documents, you'd get high confidence on both shards. This leads to a 50/50 chance that the correct shard will have a higher confidence than the incorrect one.

The idea was to perform semantic sharding such that all documents of a type would be in one shard. This would resolve the confusions between shards as each one would know which document needs to be retrieved and the others will have low confidence in their result this way.

So I:

I compared the results and it turns out it doesn't work as hoped

Screenshot 2024-05-21 at 14 22 37

I believe the issue is that if shards have semantically meaningful documents, it is harder to distinguish between them and so the confidence of the correct shard is lower than before. This means that more-different-but-still-slightly-similar documents which are in other shards have a higher chance than before to beat the confidence of the correct document in the correct shard.

I thought about what exactly I could do about this but I haven't come up with anything yet. Jeremie recommended I look into fully decentralized ML training which is resistant to some kind of attacks. I have an idea on how it may be done but I need to read more on it first as its a new topic to me.

pneague commented 1 month ago

In the last few days I've read papers on

I also thought about how a mixture-of-experts with multi-layered semantic sharding would work. At the moment something that I could try would be:

I also haven't found any paper on personalized models in decentralized federated learning, so it would be a gap which is unexplored and thus maybe easy to publish about.

synctext commented 1 month ago

Focus on finding a phd problem to solve. Avoid "Technology push" that makes much science useless. We need GPU's for training. We need a dataset. We need publishable problem.

Perhaps it is time to dive for 3 weeks into a production system? Some ideas and links

Hipster publishable idea: secure information dissemination for decentralised AI (e.g. MeritRank, clicklog, long-lived ID, sharing data, not unverifiable vector of gradient decent)

pneague commented 2 weeks ago

In the last few weeks I looked into methods of estimating reputation and sybil-defense in a graph network by using ML models. There are quite a few methods for doing stuff like this in all types of areas, for example in edge-computing devices, social networks etc.

After talking with Bulat, he suggested we could try to use Meritrank and some kind of model to limit the amount of resources that a sybil attack could sap from the network. The idea is still in the incipient phase and it's not clear to me if it works. Bulat suggested that instead of doing what the other papers have done (for example the papers doing reputation estimation with social networks were using social network information to find sybils), we could try to do this solely by using the graph data. I'm not sure if this is possible but I think it's in the realm of possibility.

Additionally, we would not use a supervised-learning method where we have the sybils clearly mapped, but get a dataset where we assume all members of the graph to be honest, and then perform all types of sybil attacks possible on the network and see if we can limit how much attackers gain from this somehow. We could also implement methods of previous papers and compare our results to theirs in a situation where all types of sybil attack is simulated. Bulat mentioned he doesn't know of a paper taking this approach so far.

I have also talked with Quinten about the dataset his code is collecting. It's interesting but not very rich, even if we may have lots of data. You can see a very small sample meant as an example below: image.

Basically we have query, infohash, score, parent_query_forwarding_pk. The score is calculated as thus: If you search for a query and you click a link and you don't search for the same query again, you're assumed to be satisfied with the link, so the score = 1.0 If you search for a query and you click a link and you are not satisfied, you search for the query again. If you click the second link and you are satisfied you stop there. Thus, the first link clicked has a score of 0.2 and the second link clicked has a score of 0.8.

This is interesting, and may provide a way to get reputation (for the person who's seeding the content in the first link and for the person who's gossipping the queries). But I am not sure we can do it well if we don't have that many users vs number_of_links_available. We'll have to see how much data we end up with in a few months.

synctext commented 2 weeks ago

btw about teaching...prepare for helping out with msc students more + master course of Blockchain Engineering.

update : machine learning for 1) personalisation 2) de-DSI content discovery 3) decentralised seeder content discovery {DHT becomes :point_right: IPv4 generative AI} 4) sybil protection 5) spam protection 6) learn-to-rank