Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.75k stars 445 forks source link

literature survey + master thesis: G-Rank learn-to-rank #5313

Closed synctext closed 1 year ago

synctext commented 4 years ago

Direction changed, txt will be updated soon.

Old stuff:

awrgold commented 1 year ago

Here's a clarification I feel is necessary:

Item-based collaborative filtering (as opposed to user-based CF) is a method to estimate the rating a user would give a previously unseen item, based on the ratings of items it already has seen. However, as most of us know, we don't rate songs in our library (at least in Spotify/Apple Music), we just add them to our playlists and sometimes "like" them. This, essentially, is not just a binary rating system (like vs not liked) as it's entirely reasonable to assume that the average person has songs in their library that they very much enjoy but have not "liked." I still have thousands of songs I listen to that I haven't given a "like" to.

How then does one determine a rating system for songs? Ratings might be more feasible for services such as Amazon, Uber, Thuisbezorgd, etc. because it's a more natural response to rate something tangible after consuming it, especially in the context that each item/service is consumed infrequently.

As such, my thinking is this:

It adds no complexity but to me seems much more logical way to determine the popularity. However, the frequency with which songs are listened to varies dramatically, yet this frequency does not always denote how much a user might actually ENJOY the song. Spotify assumes I absolutely love heavy metal and hard rock because I listen to it when I exercise, and I exercise very often. However, I almost never listen to it outside of the context of exercise, and I tend to ignore most of Spotify's suggestions for new songs because I like this specific playlist as it is.

The data scientist in me is starting to think of various latent features that can be extracted from things like "is this song in a playlist? How many playlists is it in? How similar are these playlists?" but this is not the point of the thesis.

So to answer your comment about adding a "not clicked log" I think this is a reasonable response.

Edit1: I said "replace the clicklog with a playlog" and that was incorrect, I instead wish to create a rating system that is based on an estimation of the number of times a user will play a song, and over time utilizing a clicklog will augment and rank the set of items shown to a user during a query.

synctext commented 1 year ago

Item-item collaborative filtering requires some form of scoring/rating function per user;

Could you please explain your thinking based on learn-to-rank terminology? What and how are you cherry-picking collaborative filtering for learn-to-rank? It seems to me that your doing some sort of "personalised recommendation, given the user search query"? That part is unclear to me.

How then does one determine a rating system for songs?

Can we drop ratings for your learn-to-rank?

I think we getting somewhere, after re-reading your comments a few times. You've zoomed in on item-to-item approach. Are you building an item-based, user-based, or "keyword-based" learn-to-rank algorithm, inspired by collaborative filtering? See this blog for how I'm thinking about this: https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f With your Spotify examples you seem to desire meaningful results. Please make an non-meaningful approach for your "click modelling" (the correct scientific term it seems). For instance, the user always selects the song which matches the number of characters of the user query. So a toy example .Py notebook, tiny artificial dataset, deliberately non-meaningful results, and unrealistic decentralisation middleware. This as-simple-as-possible approach gets to the core science without any noise (yes, that is totally weird)

awrgold commented 1 year ago

Could you please explain your thinking based on learn-to-rank terminology?

This is where I'm stuck the most, as I do recognize I'm focusing more on "recommendation" than "ranking," at least at face value. At it's heart I'd say that the ranking problem is a form of recommendation, just for a series of items based on a score. The additional variable is that we're recommending a subset of items, in some particular order, based on a specific search term.

When it comes to LTR, let's say User 0 is the very first person to search for anything after the model goes live. Because the model is unsupervised, it is completely untrained to RANK at this stage (but it can recommend), meaning the results it obtains will be purely based on the songs in this user's library (if there are any) compared with the songs other users have in their libraries.

For either option, we can then use CF to estimate what User 0 would "rate" every song in the global dataset, and rank it in order from highest to lowest. Pruning based on text filters, string matching, whatever - I won't focus on that for now.

The next part is where the real Learning to Rank comes into play.

As more and more users start searching for songs, these "ratings" (which in this simple case is just the estimated number of times a user would have played each song, thereby estimating it's popularity with the user) will be associated with a clicklog. Based on the list of songs we have recommended to a user conducting a search, the song that they click on will have its clicklog updated, which will act as a weight coefficient for the chosen song.

synctext commented 1 year ago

When it comes to LTR, let's say User 0 is the very first person to search for anything after the model goes live.

Please ignore such bootstrap effects completely for your master thesis! Assume we have a million users (e.g. Google scale emergence) Just "recommend" something purely random and learn from a ClickLog.

awrgold commented 1 year ago

Please ignore this completely for your master thesis! Assume we have a million users (e.g. Google scale emergence) Just "recommend" something purely random and learn from a ClickLog.

When it comes to an unsupervised model (at least the one I'm chewing on) it does not matter at all the history or sparsity of the dataset, the algorithm remains unchanged regardless of if there are 1 examples or 1 billion examples. I'm just trying to demonstrate the transformation the search results will undergo as more and more searches are conducted.

We of course can just use a synthetic clicklog (such as the 250 song clicklog you gave me a while ago), but algorithm will not change. The User 0 example is more of an extreme case in which the rating system is completely uninfluenced by a clicklog (e.g. the coefficient for all variables is 1) such that the ranking is based purely on the CF score, and not the CF score * clicklog.

synctext commented 1 year ago

Based on the list of songs we have recommended to a user conducting a search, the song that they click on will have its clicklog updated, which will act as a weight coefficient for the chosen song.

Hopefully that was just a quick writing style. Songs don't have clicklogs. Clicklogs are never 'updated'. They continuously grow. The users {user_ID,search query, Item_ID_clicked_on} just grows in time with new entries. Or did you use a different (e.g. matrix) data structure?

I'm just trying to demonstrate the transformation the search results will undergo as more and more searches are conducted.

Sounds like very useful work! Sorry, I'm still unable to grasp exactly what you are doing. But that's fine for now.

as noted 18 days ago

Please do a 1-page problem description + design write-up. Should be in master thesis writing level.

The aim is that this clarifies matters. Guess you want to focus on getting the first skeleton code to work and obtain the milestone of a first graph.

awrgold commented 1 year ago

Or did you use a different (e.g. matrix) data structure?

I was going to utilize a global clicklog for all queries and clicks

awrgold commented 1 year ago

Alright so after several weeks away I've been thinking about my approach so far:

Item-based collaborative filtering is not the way I should be approaching this topic. Neither is the clicklog (on its own). Too much of this problem comes down to natural language processing and direct search methodology, where most of my time will be spent matching queries with attribute values.

If this is going to be approached from an unsupervised angle, we need to be able to determine rankings based on the individual doing the searching.

First things first, I was reading some (10+ years old) article about Tribler that stated that the Bittorrent client had some kind of search mechanism, and that users could be recommended new content from the network once there were enough clients running the software, but beyond that I was unable to find any more details. How exactly does/did this work?

Secondly, we need to decide if there is going to be anything beyond basic string matching to pare down the query results. If I stick to pure string matches, I can focus entirely on ranking the results based on user attributes such as library contents, search history, etc.

Third, I'm going to either augment a current dataset or find/create a new one so that I can have slightly more variation in my search results.

Nobody seems to be doing exactly what we're doing, i.e. they may be doing unsupervised ranking or decentralized search and recommendation, but they're not doing unsupervised decentralized ranking.

synctext commented 1 year ago
awrgold commented 1 year ago

Quick update:

I now have a clicklog model and simulation that will (mostly) do the following:

I say "mostly" because I'm dealing with a few bugs that are eluding me, in particular if a node performs the same search twice in my simulation, it does not update the clicklog. Trying to figure that one out right now...

synctext commented 1 year ago

related work from ancient history: 2012, Distributed Machine Learning Using the Tribler Platform by msc student of Mark Jelasity.

awrgold commented 1 year ago

Alright, so now everything mentioned above is working.

However, there's one primary issue: My simulation is uniform randomly choosing a result from a top-10 list of results, which essentially means the simulation in its current form can never "converge" on the "correct" item(s).

Reminder that the entire corpus from which I'm randomly choosing search terms is built from words and terms contained in the list of songs, so there is always at least one "match" for each query.

The clicklog, in its current form, starts in an empty state with no results whatsoever, in which case the simulation returns 10 random songs with no string matching whatsoever, so the initial state of the very first query-results pair seen by the clicklog strongly influences which items each search has to choose from.

YES, I remember that you said "leave the cold start problem to others" but building this algorithm was the first step towards building a synthetic clicklog with past results.

The next step of creating a synthetic clicklog with past results will either be tedious manual work, or I will have to use some kind of semi-stochastic process to generate "realistic" clicks, e.g. if the search query term is "Bohemian Rhapsody" and the top-N results don't contain any Queen songs or really any 70's/80's rock music, the user is unlikely to click anything.

Simulating this human element is going to be the next major challenge. We can discuss next week, but I'll keep working for now on a clicklog that has past results, so that we can see some kind of convergence towards "correct" results.

https://github.com/awrgold/distributed-learn-to-rank-v1/blob/main/Building%20v2.ipynb

synctext commented 1 year ago

Expand operational scrape code to build tag corpus.

{repeating} - Lets focus on click modelling and zero distractions until we have running code. Can also be another concrete part of your thesis. That click modelling is sort of the ground truth, but is lacking in the real world. Deterministic random!?! https://scholar.google.com/scholar?q=click+modelling&pagesize=100

This thesis presents two experiments on learn-to-rank. First is a validation experiment. The search corpus is using tags as the search terms. Our stating point is a music dataset with tags. (256 albums x 3 tags/item = 768 total tags). We crafted 32 synthetic users which have each a unique ClickLog. We use the tags as search terms and round-robin rotate through them (e.g. recycle). Air Your Grievance album has tags: post.metal, instrumental, and post.rock. Users 1, 2, and 3 issue the search terms post.metal, instrumental, and post.rock and all click on the same item: Air Your Grievance album.

Users search also for multiple items. User 1 only has 1 item on their ClickLog, item number 1. User 2 has all items of User 1 and 1 additional item from the 256 albums. User 3 has all items of User 2 and 1 additional item from the 256 albums. etc. User 32 thus has 32 items in their ClickLog and the associated tags. User 1 and user 2 behave very similar, they have 50% overlap of their ClickLog. User 31 and user 32 are extremely similar, they overlap for 31 of 32 items. Each of the 32 users differs on the tags they searched for. We also generate for each of the ClickLog entries of our 32 users the tag they used for searching. We use the unique 768 tag-item pairs to simulate their search term and ClickLog entry. The following graphs present the outcome of using this synthetic clicklog without our experiments. Each user is searching for one tag associated with their item (out of 768 item-tag pairs).

We meticulously designed this validation experiment to converge. If a user merely clicks randomly we also expect a random outcome and are unable to see bugs or flaws. Our artificial users are highly similar and we can utilise this strong signal to converge. This provides some evidence for the correctness of our implementation.

Our above synthetic experiment enables us to compare the performance of three ML algorithms.

awrgold commented 1 year ago

Something I noticed immediately:

With the 39 users (I upped it to 39 because there are 39 tags, so each user searches explicitly for one unique tag, and gets a set of unique results associated with that tag) the earlier nodes only have a few items in their clicklogs, and if we're basing search results on the clicklog, these lower-ID users have fewer options to search from and are unable to be exposed to new items unless we start introducing them to new items with the same tags, over which a large number of searches this becomes virtually indistinguishable from just giving N users a clicklog of N items.

Otherwise, the other direction is if we were allow nodes to search the clicklogs of other nodes for "inspiration" when gathering potential search results.

awrgold commented 1 year ago

Right now I have a clicklog with 39 unique entries - each unique node searches for one tag, and the next node appends the previous node's clicklog + 1 new item, for 39 entries. This is the baseline for the simulation. Thereafter, random nodes search for random tags, and we check for previous searches and clicks for that tag. Previous clicks are appended into a top-10 list that is presented in descending order by #, so each [tag, song] pair is presented to the user ranked by popularity.

However, right now if there are fewer than 10 clicked items for that tag, I have made a few separate functions:

One of the questions I have for the future has to do with the node's local library vs. the libraries of other nodes. Do we want to start investigating the libraries of other nodes, what is popular to them, and integrate that in the ranking algorithm? Or do we stick purely to local clicklogs?

Another question is if we stick to local clicklogs, or if we integrate a gossip mechanism for passing around updated clicklogs periodically, as eventually the clicklog becomes saturated with just the results for a specific node, meaning it doesn't learn from the behavior of other users. I certainly find this interesting but it would involve "assumptions" of some sort of gossip mechanism, or it would require me to integrate it.

https://github.com/awrgold/distributed-learn-to-rank/blob/main/Building%20v3.ipynb

synctext commented 1 year ago
awrgold commented 1 year ago

Good news: I have graphs Bad news: It doesn't look pretty.

I spent the last week fixing a few critical errors in logic I made a while ago, and then finally early this week shifted towards making the graphs. I can show the distance between the globally optimal ranking and the local results a node sees when it makes a specific query, but the distance is wildly volatile.

The reason why it's volatile is because I'm only considering the distance between items that SHOULD be on the list (those that appear in the globally optimal ranking) and those that ARE on the list (those that a node sees when he performs a search). Any item that a node sees that shouldn't be there is ignored (because the distance is technically infinity).

For example: node 1 searches for jazz at time T and gets the following songID's as a result: [1, 2, 3, 4, 5]. Globally speaking, at time T the optimal ranking for jazz is: [1, 7, 93, 95, 5]. Because the intersection of these two rankings is [1,5] I only can see the distance of indices for these two results. Therefore, the distance is [0, 0] = 0, which indicates a "perfect ranking" when in fact there were 3 erroneous results given to the node ([2,3,4]).

So what my graphs show is a pretty volatile result. When a user performs a search for term term and chooses a completely random result: image

When a user performs a search for term term and has a 50% chance of choosing the top result, and a 50% chance of choosing another random result: image

So it's clear that I need to rethink how this metric is being calculated, and what it is exactly that I'm trying to investigate.

Edit: mind you, the reason why it is so volatile is likely because (1) I am not gossiping local clicklogs to other nodes yet, and (2) the search algorithm is gathering results from other nodes without exception, meaning that I have a local clicklog with global access to all songs in the network.

Edit2: This is what the global clicklog looks like when a specific term has been searched for.

image

synctext commented 1 year ago
awrgold commented 1 year ago

image

When we start with 32 nodes and perform 100 random searches, the number of nodes that contain in their library (assuming that when they click on a song, they add it to their library) the most popular song for each tag is shown above.

When we allow gossiping, the number of most popular songs starts to grow, but of course the number of nodes with whom we gossip directly influences this.

Where each node regularly gossips their clicklog with exactly 1 other node: image

Where each node regularly gossips their clicklog with 5 other nodes: image

When we increase the number of searches from 100 to 500, we see a much larger growth in the number of nodes containing the most popular song for each tag: image

What we see is that after a large enough amount of searches performed, the graph begins to even out and creep towards all 32 nodes having seen the most popular song for each tag. 5000 searches with 5 gossip targets per node:

image


However, when we start growing the network, so long as we don't increase the # of gossip targets proportionally with the network growth, we see a slower convergence. 100 nodes with 1000 searches, each with 5 random gossip targets per gossip round: image

This is where the few obvious things start coming into play:

This is my next TODO, besides for writing.

synctext commented 1 year ago

Related work: {feel free to ignore these network protocol experiments, for ML experts} The prior cycle of decentralised research produced many ideas and publications. in 2008 Cornell proposed a decentralised approximate keyword search algorithm with some naive measures against spam, poisoning, and Sybils. Relevance ranking was out of scope in these years (2008). image The above paper introduced the concept of progress ratio. The next experiment explores a similar concept. We vary on the X-axis the spread of the ClickLog in a stationary environment. We then determine the effectiveness of searches for various systems loads of ClickLog spreadings. In a network with 1000 participant all conducting active searches concurrently we can estimate that a single ClickLog update consists of a 1KByte cryptographically signed message. 1000 people x (1000-1 receivers) x 10 queries x 1 KByte = 10GByte load for full real-time spreading without duplicate messaging.

awrgold commented 1 year ago

I just remembered, I still need to do a literature survey for the 10 ECTS... How will we handle this?

awrgold commented 1 year ago

We vary on the X-axis the spread of the ClickLog in a stationary environment.

In this case, do you mean that our X-axis is the size of the set of gossip targets and/or the rate of gossip/broadcast?

We then determine the effectiveness of searches for various systems loads of ClickLog spreadings.

"Effectiveness of searches" is a bit vague to me here. Based on the following example, I almost get the feeling that you're interested in seeing the rate of growth in file size of the clicklog as it propagates throughout the network. However, you also mention "system loads" which could mean a wide variety of things. Could you clarify this second point?

Also: https://www.overleaf.com/read/wjydsnzwtytb

awrgold commented 1 year ago

Please finish the initial thesis experiment with a trivial-by-design graph

At this point I just need to determine what kind of click modeling I'm going to do, if any. We have the "lowest songID tiebreaker" but if there's no tie, it will "flip a coin" where it chooses the top result 50% of the time, and a random result in the list 50% of the time, to add some variance.

We need to decide if this is sufficient, or how we want to model this further. If we wanted to go further into the statistics, we could use a skewed sampling method for weighting songs with lower songID's more heavily, so that we can demonstrate that by gossiping the clicklog the most popular songs across all devices start to converge towards a global truth.

I could contrast this with a uniform random sampling that shows that if users choose returned results randomly, over a sufficiently large number of queries we will still see a convergence towards the most popular songs. Mind you, the rate of convergence depends entirely upon how many results per query are shown to a user. The higher the number of results shown, the longer it will take for results to converge across nodes.

The question, then, is what it is exactly we want to demonstrate at the end of the day.

awrgold commented 1 year ago

image

The median % of nodes containing the most popular song per tag, over time. Right now, gossip rounds occur every 100 queries (globally) and each node gossips with 10 nodes per gossip round.

As we can see this grows logarithmically. (This is for a 100-node network)

awrgold commented 1 year ago

Updated notebook: https://github.com/awrgold/distributed-learn-to-rank/blob/main/Testing%20v1.1%20-%20Expanded%20Clicklog.ipynb

synctext commented 1 year ago
awrgold commented 1 year ago

image

Result for 1000 nodes. As the network grows large, the growth rate slows dramatically

awrgold commented 1 year ago

Thesis Timeline

I was told to toot my own horn in public here

awrgold commented 1 year ago

Another thesis milestone: you have a 1000-people simulated graph. That is what you need for graduation level. 10k in-memory is obviously cooler :-)

  • wrap up with: new/enhanced/tweaked algorithm (compare with self-build basic algorithm)

I'm working on an updated/enhanced version of the algorithm, particularly focusing on utilizing a ranking scheme based on a distance metric between users (based purely on known clicklog entries).

I'm currently in the final stages of building a Jaccard distance-ranking scheme where a node looks at its clicklog to determine which nodes have the highest similarity of searches and clicks to itself (utilizing the Jaccard distance)

I am standardizing the distance metric by dividing each node's relative score by the maximum score in the list, so the "most-similar" user to another has a score of 1.0, and the least similar 0.0.

This allows nodes to increase/decrease in relative similarity score as clicklogs grow larger, without having an ever-growing distance metric.


We have a meeting on Monday, but the last week has been pretty uneventful for me. I did a solid programming sprint on that uber-long flight from London to Phoenix, but besides for that the last week has been relatively low productivity. Next week I have a few days before Thanksgiving, and then it will be another long 4-day weekend.

After Thanksgiving (28th November) I'll be back to working full-time on this. I won't be working at the startup much anymore so it'll be a true full-time sprint from here on out.

synctext commented 1 year ago

Reviewing your draft Master thesis on Overleaf

In many such settings, the parameters
of the model (e.g. the weights and biases of a trained neural
network) are often passed via messages between nodes in the
network, training on the local data before passing the model
parameters to the next node for further training.

An example of this related work: Backdoor Defense in Federated Learning Using Differential Testing and Outlier Detection and cosine similarity + reputation: CONTRA: Defending against Poisoning Attacks in Federated Learning

@devos50 insight: this is not one-shot learning. Its continuous learning. Related work: Lifelong machine learning: a paradigm for continuous learning. Emphasis that your work is light-weight or smartpone related work: Cost-Efficient Continuous Edge Learning for Artificial Intelligence of Things. This makes the following master thesis title suggestion: P2P search engine using continuous learning for edge devices. {with extreme decentralisation: P2P Search Engine based on Extreme Decentralisation and Continuous Learning.}

Two week sprint focus: finish writing of first thesis sections (FINAL thesis quality writing please)

awrgold commented 1 year ago

Two week sprint focus: finish writing of first thesis sections (FINAL thesis quality writing please)

Did one of your recent edits remove the changes that we wanted to see in the first two sections? I remember seeing some bullet points regarding changes (I see the one regarding the problem description), but I remember during our meeting that we had decided to focus more on the decentralized unsupervised aspect of everything.

Were there any other points besides the problem description re-write?

awrgold commented 1 year ago

image

You might like this graphic, across all nodes for every gossip round I looked at the mean and median distances between the global best ranking, and the rankings each node would get for each query term.

Shows that after just a few gossip rounds most nodes are very close to the best possible ranking for all query terms.

I did a scatter plot of the same metrics, but instead of just the mean and median, I did a scatter plot for the average distance each node's searches is from the global optimum:

image

EDIT: Here's the same data, but instead a random subset of 10% of the nodes gossip to a random subset of 50% of the network (more of a multicast/broadcast than a gossip, I know...) whereas the above graph is where 10% of the nodes gossip to 10% of the network.

image

Analysis also showed that the outlier in the first graph was a node that only received one round of gossip at the very beginning (bad luck I guess) and therefore was not able to improve its ranking scores.

Also, what about this format? I need to find a better title because it's not quite exactly descriptive of what the data says, but I like how we can show the progression over time:

image

awrgold commented 1 year ago

Besides for writing, the only thing I'm working on code-wise (besides visualizations) is trying to determine why the outliers are not converging over time.

My theory is that my simulation environment is 100 nodes and 10,000 queries, and each gossip round selects at uniform random 10% of the nodes in the network. Because there are a few terms that have fewer queries (purely by statistical chance) and a few nodes that are not receiving as much gossip as others, the clicklogs with these queries are not being shared as frequently.

One of the questions I have for our upcoming meeting involves how we want to discuss the influence of gossip on this model. Do we want to have various simulations with varying gossip parameters? The size and completeness of local vs. global clicklogs has a very strong influence on the performance of the model.

All of my simulations so far have been gossiping to a static number of nodes. I have a good feeling that I can improve results quite a bit by one or both of the following:

  1. (trivial) Increasing the number of nodes per gossip round
  2. (relatively easy but not trivial) All nodes currently have a list of nodes that they're "aware of," which plays a part in how they're being recommended results. I can leverage this list of nodes that they're aware of to ensure that they gossip with some nodes they are familiar with, as well as passing along messages to other nodes they are unfamiliar with.

Option (2) will lead to a more rapid dissemination of clicklog data in early gossip rounds, at the expense of early-round clustering of users, meaning in early stages users will have their search results more heavily influenced by the overall number of clicks an item receives, with the drawback being that user clusters will take more gossip rounds to coalesce. The question is, how deep into gossip do we want to go?

synctext commented 1 year ago
awrgold commented 1 year ago

This idea is a huge attack vector.

If we decide to drop this, then we're back to where we were about a month ago. If I'm to improve upon what I've had this whole time, I need to be able to derive some sort of insight from the clicklog beyond just clicks.

Is the clicklog itself not just the same exact vector? I understand that sybil attacks are a thing but even if I weren't doing this, we'd still have the same exact vector in place: i.e. sybil attack via clicklog saturation, replay attacks, altering gossip messages, etc?

I feel like my model has quite a bit of vulnerability to sybil attack in the first place, so I had kind of assumed we would be abstracting that problem away and then utilizing other mechanisms outside the model itself to mitigate such threats...

Title suggestions: P2P Search engine without any training dataset

The first one is just a longer way of saying unsupervised learning, I would keep it as "unsupervised learning"

P2P search engine using continuous learning for edge devices

Continuous learning, sure. Edge device seems like a non-personal device, though, no? Though I guess by definition it exists at the boundary of a network, so yeah.

Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network

How does that sound?

synctext commented 1 year ago

the winner is! Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network.

Just show in 1 or 2 graphs the impact of Sybil's and fake ClickLogs. Conclusion: trust problem needs solving. Measure CPU somehow to show that you have a light algorithm, as-simple-as-possible. Just laptop graphs are fine too.

synctext commented 1 year ago

Comments on this latest thesis LTR_Thesis_v1 (1).pdf Plus also 3 weeks later update: LTR_Thesis_v1 (3).pdf Below is a hopefully strictly decreasing ToDo list till master thesis completion.

awrgold commented 1 year ago

This is the performance if every single gossip round includes 5 sybil nodes performing a large number of queries and then broadcasting their clicklog to the entire network. Every gossip round introduces 5 new sybil nodes, so the number of attackers increases over time. I'm working on a parallel sim of a fixed number of sybil attackers, as well as a sim that has gaps between sybil attacks, to see if performance can improve in between sybil attacks.

If we assume an attacker has the capability of finding every node in the network, this is the worst case performance. If we assume that broadcasting to the entire network is impossible (since nodes do not forward gossip, which I guess means we should call it a multicast and not a gossip message, right?) and that nodes can only send messages to a select subset via some messaging protocol, then we should see less shitty performance.

If we increase the number of rounds between spam attacks, such that they're more sporadic, I have a feeling that we'll see the model rebound as the distance between spam attacker behavior and normal user behavior starts to diverge.

image

image

awrgold commented 1 year ago

Alright, so I fixed a few errors, a key one being that I was taking "global optimal ranking" while considering the performance for sybil-nodes. If I remove sybil-nodes from consideration of what "global optimal ranking" means, the results dramatically improve.

What I also noticed is that in my previous graphics, those were a demonstration of what would happen if attackers were capable of broadcasting their gossip to the network.

However, if gossip were performed more "securely" e.g. nodes cannot pre-select themselves to be included in gossip rounds, and nodes cannot broadcast but must multi-cast to a subset of the network, then the threat of sybil and spam attacks becomes significantly reduced.

That being said, this merits at least a brief discussion on the topic I believe. I know that there are various ways of conducting "secure gossip" without a centralized coordinator, but how much I want to mention this in the thesis is something we should discuss.

Anyways, in the scenario where sybil or spam attackers can flood their local clicklogs or purposefully click on low-ranked results to attempt to undermine the algorithm, and then are forced to wait until a proper gossip round occurs in order to disseminate their infected clicklogs, the performance naturally is much more stable and predictable.

image

Note: I'll include a vertical line on the graph to indicate when infected gossip gets transmitted.

synctext commented 1 year ago

However, if gossip were performed more "securely" e.g. nodes cannot pre-select themselves to be included in gossip rounds

{brainstorm no need to deep dive into gossip, just a suggestion if this matches your results}

Push versus pull experiment

After a validation and a deliberately simple experiment we now introduce attackers. Within this third experiment we introduce a number of malicious peers which conduct a Sybil attack. This experiment shows the dramatic impact of push versus pull gossip. This experiment proves that is is vital for security that malicious peers can not easily insert their polluting content with honest peers; a push architecture. With a pull architecture peers are more autonomous and decide individually the speed of incoming information, if their trust another peer, or may randomly sample from discovered peers. Malicious peers in this experiment try to push two messages per gossip round. With the pull architecture, only 1 message per gossip round is accepted. As Internet bandwidth is cheap, this simple experiment shows a first line of defence against ClickLog spam consists of a pull architecture.

G-Rank related theory work: Decentralized Stochastic Optimization with Client Sampling, EPFL

In this work, we study Decentralized Stochastic Gradient Descent (D-SGD) with node subsampling, i.e.
when only s (s ≤ n) nodes are randomly sampled out of n nodes per iteration. We provide the
theoretical convergence rates in smooth (convex and non-convex) problems with heterogeneous
(non-identically distributed data) functions.

Sprint till 29December: no new experiments. Focus on polishing, polishing, and polishing. Not a tutorial, a specification. Reproducibility is still 0, master student in your field should be able to make roughly the same graphs.

awrgold commented 1 year ago

So I have been thinking about how the simulation is "bootstrapped" where a while ago we decided that Node N would perform a search before the simulation and append this to its clicklog, followed by appending the clicklog of Node N-1 such that Node 99 has the clicklogs of every node, and Node 0 has only its own clicklog.

I think that at this stage, this is a bad way to bootstrap because what is essentially happening now (since I've removed all central coordination) is that nodes are gossiping to nodes they've received gossip from in the past, since the only knowledge of other nodes that they have is the clicklog itself. This leads to the situation where Node 99 is able to gossip to anybody, but lower-ID nodes can barely gossip at all until they (hopefully) receive some gossip from a more knowledgeable node.

Is this a good idea?

I was thinking that I could bootstrap the network where all nodes start off with a roughly even-sized (randomly initialized) clicklog where the nodeID is randomly initialized, so that all nodes have a roughly even-sized knowledge base of existing nodes, which means that gossip will be much more uniformly distributed.

What do you think @synctext?

synctext commented 1 year ago

Good find, interesting content to discuss within your thesis! Within the Section dataset and experimental setup

Discovery bias experiment

Finally, we conduct a small experiment to demonstrate the sensitivity of the node discovery process within distributed machine learning. Realistic simulations lack any centrality and thus have no central coordinator to discover other nodes. Our design integrates node discovery using the ClickLog itself! Thus a single ClickLog message provides both overlay network information for gossip and machine learning training data. The following measurement results demonstrates how our simulation is bootstrapped. Node 0 performs the first search, the start of the simulation. The Clicklog is generated and send to Node 1. The incoming ClickLog at Node 1 is processed and leads to the discovery of Node 0. The next query is.... The following graph shows for each of the 100 Nodes (listed on X-Axis) the amount of nodes (Y-Axis) they are aware of after the first 100 rounds of bootstrap gossip. This results shows that this deliberately simple bootstrap method leads to a highly skewed distribution in the number of gossip neighbors. Several more rounds of random gossip are required to obtain a healthy and balanced gossip overlay.

awrgold commented 1 year ago

The following graph shows for each of the 100 Nodes (listed on X-Axis) the amount of nodes (Y-Axis) they are aware of after the first 100 rounds of bootstrap gossip.

This will just be a 45 degree line from bottom left to top right, is that worth including do you think?

Several more rounds of random gossip are required to obtain a healthy and balanced gossip overlay.

Do we perform this gossip or just mention it?

synctext commented 1 year ago

Yes, a clarifying 45 degree line. It proves a point and validates environment. If it looks nice then also show how long the required warmup period is for a balanced overlay.

awrgold commented 1 year ago

So I just realized another major issue: if we instantiate malicious nodes after the network has been bootstrapped, they will never discover other nodes as they will never receive gossip.

Right now, I have no "node discovery" mechanism such that gossip really isn't "gossip," it's just a multicast. When a node receives gossip, he does not forward it along to other nodes.

As such, both benign and malicious nodes can join the network but will never receive any gossip because they do not appear in the clicklogs of other nodes. Therefore I wonder if it's a good idea if we have a "bootstrap" mechanism for new nodes in this network because right now we're unable to simulate any attacks. How does Tribler peer discovery work? Should we integrate that specific mechanism, or should I use the basic Gnutella peer discovery bootstrap mechanism (a set list of "home" nodes are given to each new node that joins the network).

synctext commented 1 year ago

Let's see.... yeah. Do a simple decentralised "ring discovery". Mention this mechanism and name explicitly in your thesis. Every node obtains a unique node number upon bootstrapping (e.g. identity). It only known about 1 other node. Node 0 automagically is introduced to node 1. -plus- Same construct for new spam nodes arriving later.

awrgold commented 1 year ago

Same construct for new spam nodes arriving later.

We need to be careful here, though - I don't think it's realistic to say that spamNode1 gets bootstrapped by spamNode0 because a realistic network would need its own mechanism for keeping track of the chronological order in which nodes join the network, so saying that Node N bootstraps Node N+1 would be unrealistic.

Instead, I propose using something similar to Gnutella's GWebCache system where it tries to bootstrap a node with some semi-randomly selected neighbors that the network is already aware of. (afaik there was considerable debate about GWebCache's security and effectiveness, but I digress...)

I propose that we bootstrap the initial network with the Ring Discovery method you described, but for the attackers a random node is selected to bootstrap a new incoming node with either a subset or entirety of its clicklog. If we use the sequential node.nodeID as the mechanism for bootstrapping, nodes 0-N will be bootstrapped by 1 other peer, but then all spam nodes will be bootstrapped by each other, with the exception of spamNode0 who gets bootstrapped by node N.

This essentially means that all spam nodes will be bootstrapped with the exact same clicklog, unless a spamNode joins the network, begins spamming, and then another spamNode joins the network, which then means it gets bootstrapped by all of spamNode K-1's clicklog, which is almost entirely spam, which also means it only has one potential non-spam target to spam (node N).

Does that make sense?

awrgold commented 1 year ago

I need some clarification as well: last meeting we discussed changes to the "Experimentation" section, where we said we'd have a separate section describing the dataset.

However, my notes from the meeting are ambiguous: I have written down "Dataset and Experimental Setup: Completely different section" as an item. Does this mean that "Dataset" is a section entirely on its own, and then "Experimental Setup" is also its own section? Or do we have a "Dataset and Experimental Setup" section?

If it's the former, the Dataset section I have is very very short - one single paragraph, and I can't make it longer without just writing filler bullshit.

If it's the latter, will the progression of the thesis paper go "Dataset and Experimental Setup" -> "Results" -> "Conclusion" or will it go "Dataset and Experimental Setup" -> (something else) -> "Results" -> "Conclusion"?

Right now I feel like it should be "Dataset and Experimental Setup" -> "Results" -> "Conclusion" but then that means that nothing has changed, except the name of the section.

synctext commented 1 year ago

See all my proposed sections: https://github.com/Tribler/tribler/issues/5313#issuecomment-1342886464

Good point on the ring bootstrap issue. Feel free to describe something smarter and realistic. But I would do that as a bonus, after all sections are final thesis polish.

awrgold commented 1 year ago

So big news - I discovered a majorly critical error in judgment that I had been making:

I had been telling each node to keep track of its gossip progress so that whenever it came time to gossip again, it wouldn't gossip its whole clicklog - just the parts since the last time it gossiped.

However, the way I had implemented it meant that if a node was gossiping to 5 other nodes, the first of 5 nodes would get a full gossip message and the other 4 would get nothing (since the progress was updated post-gossiping). I changed this such that all nodes gossip their entire clicklog in each gossip message (messages are still small at later stages of the simulation, huge clicklogs are still under 500KB).

By changing this, the accuracy and tenacity of the model becomes significantly better, even after an attack. Now that nodes are properly sharing their clicklogs, the model is much more robust at delivering accurate results.

However, once the spam nodes arrive, the simulation time becomes intractable. I went from a 10h full simulation to (estimated) 200h simulation. I canceled that simulation after about 11 hours, because once the spam attack happened, it would take roughly 10 minutes per query. This tells me that something regarding mass gossip with full clicklogs is extremely inefficient, even when I cast to lower-level numpy arrays to speed things up.

As such, I'm thinking of a few compromises:

Thoughts?

(and yes I know you want to focus on writing but I was running simulations in the background while writing and noticed some major errors (model getting progressively worse over time) and that made me do a bit of debugging, where I realized my mistake - now we're back to running sims while writing)