literature survey + master thesis: G-Rank learn-to-rank

Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.75k stars 445 forks source link

literature survey + master thesis: G-Rank learn-to-rank #5313

Closed synctext closed 1 year ago

synctext commented 4 years ago

Direction changed, txt will be updated soon.

Old stuff:

awrgold commented 1 year ago

Here's a clarification I feel is necessary:

Item-based collaborative filtering (as opposed to user-based CF) is a method to estimate the rating a user would give a previously unseen item, based on the ratings of items it already has seen. However, as most of us know, we don't rate songs in our library (at least in Spotify/Apple Music), we just add them to our playlists and sometimes "like" them. This, essentially, is not just a binary rating system (like vs not liked) as it's entirely reasonable to assume that the average person has songs in their library that they very much enjoy but have not "liked." I still have thousands of songs I listen to that I haven't given a "like" to.

How then does one determine a rating system for songs? Ratings might be more feasible for services such as Amazon, Uber, Thuisbezorgd, etc. because it's a more natural response to rate something tangible after consuming it, especially in the context that each item/service is consumed infrequently.

As such, my thinking is this:

First and foremost I keep hearing your voice in my head telling me to keep it simple
Secondly, I think that instead of using a clicklog I could use a "playlog," essentially swapping the concept of "how many times has this item been clicked on?" into "how many times has this song been played (by the user)?"

It adds no complexity but to me seems much more logical way to determine the popularity. However, the frequency with which songs are listened to varies dramatically, yet this frequency does not always denote how much a user might actually ENJOY the song. Spotify assumes I absolutely love heavy metal and hard rock because I listen to it when I exercise, and I exercise very often. However, I almost never listen to it outside of the context of exercise, and I tend to ignore most of Spotify's suggestions for new songs because I like this specific playlist as it is.

The data scientist in me is starting to think of various latent features that can be extracted from things like "is this song in a playlist? How many playlists is it in? How similar are these playlists?" but this is not the point of the thesis.

So to answer your comment about adding a "not clicked log" I think this is a reasonable response.

Edit1: I said "replace the clicklog with a playlog" and that was incorrect, I instead wish to create a rating system that is based on an estimation of the number of times a user will play a song, and over time utilizing a clicklog will augment and rank the set of items shown to a user during a query.

synctext commented 1 year ago

Item-item collaborative filtering requires some form of scoring/rating function per user;

Could you please explain your thinking based on learn-to-rank terminology? What and how are you cherry-picking collaborative filtering for learn-to-rank? It seems to me that your doing some sort of "personalised recommendation, given the user search query"? That part is unclear to me.

How then does one determine a rating system for songs?

Can we drop ratings for your learn-to-rank?

I think we getting somewhere, after re-reading your comments a few times. You've zoomed in on item-to-item approach. Are you building an item-based, user-based, or "keyword-based" learn-to-rank algorithm, inspired by collaborative filtering? See this blog for how I'm thinking about this: https://medium.com/@cfpinela/recommender-systems-user-based-and-item-based-collaborative-filtering-5d5f375a127f With your Spotify examples you seem to desire meaningful results. Please make an non-meaningful approach for your "click modelling" (the correct scientific term it seems). For instance, the user always selects the song which matches the number of characters of the user query. So a toy example .Py notebook, tiny artificial dataset, deliberately non-meaningful results, and unrealistic decentralisation middleware. This as-simple-as-possible approach gets to the core science without any noise (yes, that is totally weird)

awrgold commented 1 year ago

Could you please explain your thinking based on learn-to-rank terminology?

This is where I'm stuck the most, as I do recognize I'm focusing more on "recommendation" than "ranking," at least at face value. At it's heart I'd say that the ranking problem is a form of recommendation, just for a series of items based on a score. The additional variable is that we're recommending a subset of items, in some particular order, based on a specific search term.

When it comes to LTR, let's say User 0 is the very first person to search for anything after the model goes live. Because the model is unsupervised, it is completely untrained to RANK at this stage (but it can recommend), meaning the results it obtains will be purely based on the songs in this user's library (if there are any) compared with the songs other users have in their libraries.

If User 0 has an empty library (as one would expect a new user to have) then they will only be recommended what is popular (a very broad concept that I won't focus on now, can be as simple as "show them the most popular songs in descending order")
If User 0 imports their local library into our music app, we can do one of the following:
1. Assume every imported song has a play_count of 0, and then use CF of all other users who also have the same songs in their libraries to assign an "average" rating for each item that is in User 0's library.
2. Import the play_count of each song from the other apps they've used to listen to the music, so we can see how User 0 has "rated" their catalogue, and use this as a basis for CF.

For either option, we can then use CF to estimate what User 0 would "rate" every song in the global dataset, and rank it in order from highest to lowest. Pruning based on text filters, string matching, whatever - I won't focus on that for now.

The next part is where the real Learning to Rank comes into play.

As more and more users start searching for songs, these "ratings" (which in this simple case is just the estimated number of times a user would have played each song, thereby estimating it's popularity with the user) will be associated with a clicklog. Based on the list of songs we have recommended to a user conducting a search, the song that they click on will have its clicklog updated, which will act as a weight coefficient for the chosen song.

The global clicklog will over time begin to associate specific search term queries with specific songs
The ranking of each subset of items will begin to change over time from just the "estimated number of times the user will play this song that in some way matches the search query" into "based on previous queries and clicks across all users, here is an optimized ranking based not just on an estimated rating score, but also based on what other people similar to this user has also searched for and clicked on."

synctext commented 1 year ago

When it comes to LTR, let's say User 0 is the very first person to search for anything after the model goes live.

Please ignore such bootstrap effects completely for your master thesis! Assume we have a million users (e.g. Google scale emergence) Just "recommend" something purely random and learn from a ClickLog.

awrgold commented 1 year ago

Please ignore this completely for your master thesis! Assume we have a million users (e.g. Google scale emergence) Just "recommend" something purely random and learn from a ClickLog.

When it comes to an unsupervised model (at least the one I'm chewing on) it does not matter at all the history or sparsity of the dataset, the algorithm remains unchanged regardless of if there are 1 examples or 1 billion examples. I'm just trying to demonstrate the transformation the search results will undergo as more and more searches are conducted.

We of course can just use a synthetic clicklog (such as the 250 song clicklog you gave me a while ago), but algorithm will not change. The User 0 example is more of an extreme case in which the rating system is completely uninfluenced by a clicklog (e.g. the coefficient for all variables is 1) such that the ranking is based purely on the CF score, and not the CF score * clicklog.

synctext commented 1 year ago

Based on the list of songs we have recommended to a user conducting a search, the song that they click on will have its clicklog updated, which will act as a weight coefficient for the chosen song.

Hopefully that was just a quick writing style. Songs don't have clicklogs. Clicklogs are never 'updated'. They continuously grow. The users {user_ID,search query, Item_ID_clicked_on} just grows in time with new entries. Or did you use a different (e.g. matrix) data structure?

I'm just trying to demonstrate the transformation the search results will undergo as more and more searches are conducted.

Sounds like very useful work! Sorry, I'm still unable to grasp exactly what you are doing. But that's fine for now.

as noted 18 days ago

Please do a 1-page problem description + design write-up. Should be in master thesis writing level.

The aim is that this clarifies matters. Guess you want to focus on getting the first skeleton code to work and obtain the milestone of a first graph.

awrgold commented 1 year ago

Or did you use a different (e.g. matrix) data structure?

I was going to utilize a global clicklog for all queries and clicks

awrgold commented 1 year ago

Alright so after several weeks away I've been thinking about my approach so far:

Item-based collaborative filtering is not the way I should be approaching this topic. Neither is the clicklog (on its own). Too much of this problem comes down to natural language processing and direct search methodology, where most of my time will be spent matching queries with attribute values.

If this is going to be approached from an unsupervised angle, we need to be able to determine rankings based on the individual doing the searching.

First things first, I was reading some (10+ years old) article about Tribler that stated that the Bittorrent client had some kind of search mechanism, and that users could be recommended new content from the network once there were enough clients running the software, but beyond that I was unable to find any more details. How exactly does/did this work?

Secondly, we need to decide if there is going to be anything beyond basic string matching to pare down the query results. If I stick to pure string matches, I can focus entirely on ranking the results based on user attributes such as library contents, search history, etc.

Third, I'm going to either augment a current dataset or find/create a new one so that I can have slightly more variation in my search results.

Nobody seems to be doing exactly what we're doing, i.e. they may be doing unsupervised ranking or decentralized search and recommendation, but they're not doing unsupervised decentralized ranking.

synctext commented 1 year ago

Goto the lab every Wednesday for chat with fellow master students
Our prior Tribler search efforts
- Tribler: P2P media search and sharing (2008, short overview)
- Lighthouse: A distributed recommendations system for television programs (earliest know P2P recommendation work with actual running code, thesis defended 25 August 2005) {relevant section: 5.1.6 Retrieving playlists}
- P2P-based PVR Recommendation using Friends, Taste Buddies and Superpeers (2005 !)
- Improving P2P keyword search by combining .torrent metadata and user preference in a semantic overlay (2010)
- 4P: Performant Private Peer-to-Peer File Sharing
Please focus extreme case in which the rating system is completely determined by a clicklog, zero string matching
assume no sparseness! Every search term has been seen before and you have some data to work work. So only focus on the "happy flow" and leave the "cold start problem" for others.
ClickLog and query log: {user_ID,search query, Item_ID_clicked_on}. How to mathematically represent this?
The "loss function" is that you ranked it at No. 10 on the learn-to-rank, however, the user click it; re-enforce and promote this item for this specific query.
Lets focus on click modelling and zero distractions until we have running code. Can also be another concrete part of your thesis. That click modelling is sort of the ground truth, but is lacking in the real world. Deterministic random!?! https://scholar.google.com/scholar?q=click+modelling&pagesize=100
- https://staff.fnwi.uva.nl/b.bredeweg/pdf/BSc/20142015/VanDenBogaart.pdf
- Just do a few dozen manual ones for the 256 item dataset? https://github.com/brian2509/pandacd-scrape/blob/master/export/220408/assets/output.csv#L254
- Use all of the tags of an item as the possible queries? https://github.com/brian2509/pandacd-scrape/blob/master/export/220408/assets/input.csv
- For instance, Air Your Grievance album has tags: post.metal, instrumental, and post.rock
- So we have a 256 albums x 3 tags/item = 768 entries for ClickLog
- There is no shame in Einstein simplicity and elegance. Become famous by linking three letters (E=MC^2).
- First most-simple-solution possible, deploy, evaluate, cost/benefit analysis of near infinite heuristic improvements
- implement with dataset.txt the function user_click_model() which returns these 768 entries for ClickLog.

awrgold commented 1 year ago

Quick update:

I now have a clicklog model and simulation that will (mostly) do the following:

Randomly select a node in the network to perform a search of a randomly chosen term from the corpus, N times
If the query term has never been seen before, return a top-10 list of most popularly clicked songs (no string matching, just top songs)
If the currently searching node has searched for this exact query before, return the most recent list of results from the last search, with the previously clicked result at the top (and items 2-10 ranked on previous # of clicks by this node)
If the currently searching node has NOT searched for this query, but other nodes have, return a list of the most popularly clicked results of other nodes, ranked by # of clicks. No further filtering, clustering, or ML is implemented at this point.

I say "mostly" because I'm dealing with a few bugs that are eluding me, in particular if a node performs the same search twice in my simulation, it does not update the clicklog. Trying to figure that one out right now...

synctext commented 1 year ago

related work from ancient history: 2012, Distributed Machine Learning Using the Tribler Platform by msc student of Mark Jelasity.

awrgold commented 1 year ago

Alright, so now everything mentioned above is working.

However, there's one primary issue: My simulation is uniform randomly choosing a result from a top-10 list of results, which essentially means the simulation in its current form can never "converge" on the "correct" item(s).

Reminder that the entire corpus from which I'm randomly choosing search terms is built from words and terms contained in the list of songs, so there is always at least one "match" for each query.

The clicklog, in its current form, starts in an empty state with no results whatsoever, in which case the simulation returns 10 random songs with no string matching whatsoever, so the initial state of the very first query-results pair seen by the clicklog strongly influences which items each search has to choose from.

YES, I remember that you said "leave the cold start problem to others" but building this algorithm was the first step towards building a synthetic clicklog with past results.

The next step of creating a synthetic clicklog with past results will either be tedious manual work, or I will have to use some kind of semi-stochastic process to generate "realistic" clicks, e.g. if the search query term is "Bohemian Rhapsody" and the top-N results don't contain any Queen songs or really any 70's/80's rock music, the user is unlikely to click anything.

Simulating this human element is going to be the next major challenge. We can discuss next week, but I'll keep working for now on a clicklog that has past results, so that we can see some kind of convergence towards "correct" results.

https://github.com/awrgold/distributed-learn-to-rank-v1/blob/main/Building%20v2.ipynb

synctext commented 1 year ago

Expand operational scrape code to build tag corpus.

{repeating} - Lets focus on click modelling and zero distractions until we have running code. Can also be another concrete part of your thesis. That click modelling is sort of the ground truth, but is lacking in the real world. Deterministic random!?! https://scholar.google.com/scholar?q=click+modelling&pagesize=100

https://staff.fnwi.uva.nl/b.bredeweg/pdf/BSc/20142015/VanDenBogaart.pdf
Just do a few dozen manual ones for the 256 item dataset? https://github.com/brian2509/pandacd-scrape/blob/master/export/220408/assets/output.csv#L254
Use all of the tags of an item as the possible queries? https://github.com/brian2509/pandacd-scrape/blob/master/export/220408/assets/input.csv
For instance, Air Your Grievance album has tags: post.metal, instrumental, and post.rock
So we have a 256 albums x 3 tags/item = 768 entries for ClickLog
There is no shame in Einstein simplicity and elegance. Become famous by linking three letters (E=MC^2).
First most-simple-solution possible, deploy, evaluate, cost/benefit analysis of near infinite heuristic improvements
implement with dataset.txt the function user_click_model() which returns these 768 entries for ClickLog.

This thesis presents two experiments on learn-to-rank. First is a validation experiment. The search corpus is using tags as the search terms. Our stating point is a music dataset with tags. (256 albums x 3 tags/item = 768 total tags). We crafted 32 synthetic users which have each a unique ClickLog. We use the tags as search terms and round-robin rotate through them (e.g. recycle). Air Your Grievance album has tags: post.metal, instrumental, and post.rock. Users 1, 2, and 3 issue the search terms post.metal, instrumental, and post.rock and all click on the same item: Air Your Grievance album.

Users search also for multiple items. User 1 only has 1 item on their ClickLog, item number 1. User 2 has all items of User 1 and 1 additional item from the 256 albums. User 3 has all items of User 2 and 1 additional item from the 256 albums. etc. User 32 thus has 32 items in their ClickLog and the associated tags. User 1 and user 2 behave very similar, they have 50% overlap of their ClickLog. User 31 and user 32 are extremely similar, they overlap for 31 of 32 items. Each of the 32 users differs on the tags they searched for. We also generate for each of the ClickLog entries of our 32 users the tag they used for searching. We use the unique 768 tag-item pairs to simulate their search term and ClickLog entry. The following graphs present the outcome of using this synthetic clicklog without our experiments. Each user is searching for one tag associated with their item (out of 768 item-tag pairs).

We meticulously designed this validation experiment to converge. If a user merely clicks randomly we also expect a random outcome and are unable to see bugs or flaws. Our artificial users are highly similar and we can utilise this strong signal to converge. This provides some evidence for the correctness of our implementation.

Our above synthetic experiment enables us to compare the performance of three ML algorithms.

awrgold commented 1 year ago

Something I noticed immediately:

With the 39 users (I upped it to 39 because there are 39 tags, so each user searches explicitly for one unique tag, and gets a set of unique results associated with that tag) the earlier nodes only have a few items in their clicklogs, and if we're basing search results on the clicklog, these lower-ID users have fewer options to search from and are unable to be exposed to new items unless we start introducing them to new items with the same tags, over which a large number of searches this becomes virtually indistinguishable from just giving N users a clicklog of N items.

Otherwise, the other direction is if we were allow nodes to search the clicklogs of other nodes for "inspiration" when gathering potential search results.

awrgold commented 1 year ago

Right now I have a clicklog with 39 unique entries - each unique node searches for one tag, and the next node appends the previous node's clicklog + 1 new item, for 39 entries. This is the baseline for the simulation. Thereafter, random nodes search for random tags, and we check for previous searches and clicks for that tag. Previous clicks are appended into a top-10 list that is presented in descending order by #, so each [tag, song] pair is presented to the user ranked by popularity.

However, right now if there are fewer than 10 clicked items for that tag, I have made a few separate functions:

The remaining items are populated by other items with that specific tag that have not been clicked on
The remaining items are populated by random popular items that don't have that tag
The remaining items are populated by items in that node's personal library that are popular
The remaining items are populated by items in other nodes' libraries that are popular (this is the most complicated)

One of the questions I have for the future has to do with the node's local library vs. the libraries of other nodes. Do we want to start investigating the libraries of other nodes, what is popular to them, and integrate that in the ranking algorithm? Or do we stick purely to local clicklogs?

Another question is if we stick to local clicklogs, or if we integrate a gossip mechanism for passing around updated clicklogs periodically, as eventually the clicklog becomes saturated with just the results for a specific node, meaning it doesn't learn from the behavior of other users. I certainly find this interesting but it would involve "assumptions" of some sort of gossip mechanism, or it would require me to integrate it.

https://github.com/awrgold/distributed-learn-to-rank/blob/main/Building%20v3.ipynb

synctext commented 1 year ago

storyline option: The goal of this first validation experiment is demonstrating correctness. Our highly synthetic workload of the 39 clicklogs will result in a predictable and understandable outcome. Our experiment result show exactly what we expected, therefore improve the confidence within our machine learning infrastructure. (e.g. unit testing workflow) The most prominent feature is that the clicklogs are quite similar, the popular items are very popular. With the following popularity graph we see the most popular item being shared by all users. This is exactly what we expected to see.
Minimal viable machine learning!
- as said: no shame in Einstein simplicity and elegance
- no eigenvectors please, that old 1999 Pagerank stuff is too advanced for production decentralisation
thesis writing methodology: achieve master thesis grade figures, write experimental section, reverse-engineer the problem, add intro, DONE and graduate.
Let me get this straight: do you want to know if in a fully distributed system with potentially 2 billion users we want to find a formula for the global popularity of items with definitive answer if this is obscure or a hit. or we want to want to avoid this global knowledge at all cost in favour of neighbourhood emergence. Assume you have the collection of clicklogs from 10k-ish neighbors with gossip updates.
time till perfect ranking
- this thesis is focused on the first decentralised ML algorithm for production
- for feasibility reason we focus on the static case with a fixed item set. Adding to our work the ability of content discovery for new items is left for future thesis work. We present 1 dynamic experiment in which both the number of items and clicklog sizes are doubled during the simulation. Results show our work is future-proof and able to handle this additional complexity in future work.
What is the source of serendipity?
Precision and recall
for the data structure, like groupBy heavy pandas are used
- thesis core feature is minimal viable explainable, verifiable, and distributable machine learning
- write from scratch with minimal code
- you hereby get a free pass from doing things in Kotlin or Python. Graduate on a single .ipynb file {disclaimer: open question is the thesis novelty if "production" Kotlin is dropped; make novel algorithm??}
Next sprint: popularity graph of validation experiment, write experiment text, write problem description: MvP ML for production, etc. then elegant storyline in first draft. Superior starting point for your 3-4 thesis experiments. aim: 4-pages
Shouting a bit now: I WANT MY GRAPH FOR REAL THIS TIME.

awrgold commented 1 year ago

Good news: I have graphs Bad news: It doesn't look pretty.

I spent the last week fixing a few critical errors in logic I made a while ago, and then finally early this week shifted towards making the graphs. I can show the distance between the globally optimal ranking and the local results a node sees when it makes a specific query, but the distance is wildly volatile.

The reason why it's volatile is because I'm only considering the distance between items that SHOULD be on the list (those that appear in the globally optimal ranking) and those that ARE on the list (those that a node sees when he performs a search). Any item that a node sees that shouldn't be there is ignored (because the distance is technically infinity).

For example: node 1 searches for jazz at time T and gets the following songID's as a result: [1, 2, 3, 4, 5]. Globally speaking, at time T the optimal ranking for jazz is: [1, 7, 93, 95, 5]. Because the intersection of these two rankings is [1,5] I only can see the distance of indices for these two results. Therefore, the distance is [0, 0] = 0, which indicates a "perfect ranking" when in fact there were 3 erroneous results given to the node ([2,3,4]).

So what my graphs show is a pretty volatile result. When a user performs a search for term term and chooses a completely random result:

When a user performs a search for term term and has a 50% chance of choosing the top result, and a 50% chance of choosing another random result:

So it's clear that I need to rethink how this metric is being calculated, and what it is exactly that I'm trying to investigate.

Edit: mind you, the reason why it is so volatile is likely because (1) I am not gossiping local clicklogs to other nodes yet, and (2) the search algorithm is gathering results from other nodes without exception, meaning that I have a local clicklog with global access to all songs in the network.

Edit2: This is what the global clicklog looks like when a specific term has been searched for.

synctext commented 1 year ago

Please finish the initial thesis experiment with a trivial-by-design graph
Write a single page (or less) by finishing text proposal above "The goal of this first validation experiment is demonstrating".. correctness. " 32 users with popularity graph. X-axis all 32 items; Y-axes "amount of playlists which feature this item".
Add a picture of your dataset inside thesis draft and write 1st version of dataset Section.
bigger experiment with 100 users or 1000 users. Solution for click modelling problem needed.
Click modelling: bootstrap fase. Each search is for a tag. Multiple songs have the same Jazz tag, use numbering as a tie breaker. Basically use tags+songID for the optimal rank ground truth. 100% clicks on lowest songID with matching tag. End of bootstrap: defined as every tag has been search for and is inside a ClickLog.
Outcome of bigger experiment: simple and synthetic convergence. Lots of people search and ClickLog gets filled. Think of a synthetic validation experiment.
ToDo for discussion: last grand experiment of thesis?

awrgold commented 1 year ago

When we start with 32 nodes and perform 100 random searches, the number of nodes that contain in their library (assuming that when they click on a song, they add it to their library) the most popular song for each tag is shown above.

When we allow gossiping, the number of most popular songs starts to grow, but of course the number of nodes with whom we gossip directly influences this.

Where each node regularly gossips their clicklog with exactly 1 other node:

Where each node regularly gossips their clicklog with 5 other nodes:

When we increase the number of searches from 100 to 500, we see a much larger growth in the number of nodes containing the most popular song for each tag:

What we see is that after a large enough amount of searches performed, the graph begins to even out and creep towards all 32 nodes having seen the most popular song for each tag. 5000 searches with 5 gossip targets per node:

However, when we start growing the network, so long as we don't increase the # of gossip targets proportionally with the network growth, we see a slower convergence. 100 nodes with 1000 searches, each with 5 random gossip targets per gossip round:

This is where the few obvious things start coming into play:

Click modeling
Gossip (yes I know we said assume it happens, but gossip vs. broadcast is a big difference, even here)
Network size
Number of searches

This is my next TODO, besides for writing.

synctext commented 1 year ago

Related work: {feel free to ignore these network protocol experiments, for ML experts} The prior cycle of decentralised research produced many ideas and publications. in 2008 Cornell proposed a decentralised approximate keyword search algorithm with some naive measures against spam, poisoning, and Sybils. Relevance ranking was out of scope in these years (2008). The above paper introduced the concept of progress ratio. The next experiment explores a similar concept. We vary on the X-axis the spread of the ClickLog in a stationary environment. We then determine the effectiveness of searches for various systems loads of ClickLog spreadings. In a network with 1000 participant all conducting active searches concurrently we can estimate that a single ClickLog update consists of a 1KByte cryptographically signed message. 1000 people x (1000-1 receivers) x 10 queries x 1 KByte = 10GByte load for full real-time spreading without duplicate messaging.

awrgold commented 1 year ago

I just remembered, I still need to do a literature survey for the 10 ECTS... How will we handle this?

awrgold commented 1 year ago

We vary on the X-axis the spread of the ClickLog in a stationary environment.

In this case, do you mean that our X-axis is the size of the set of gossip targets and/or the rate of gossip/broadcast?

We then determine the effectiveness of searches for various systems loads of ClickLog spreadings.

"Effectiveness of searches" is a bit vague to me here. Based on the following example, I almost get the feeling that you're interested in seeing the rate of growth in file size of the clicklog as it propagates throughout the network. However, you also mention "system loads" which could mean a wide variety of things. Could you clarify this second point?

Also: https://www.overleaf.com/read/wjydsnzwtytb

awrgold commented 1 year ago

Please finish the initial thesis experiment with a trivial-by-design graph

Write a single page (or less) by finishing text proposal above "The goal of this first validation experiment is demonstrating".. correctness. " 32 users with popularity graph. X-axis all 32 items; Y-axes "amount of playlists which feature this item". :white_check_mark:
Add a picture of your dataset inside thesis draft and write 1st version of dataset Section. :white_check_mark:
Bigger experiment with 100 users or 1000 users. :white_check_mark: Solution for click modelling problem needed. :no_entry_sign:
Click modelling: bootstrap phase. ----- Each search is for a tag. :white_check_mark: ----- Multiple songs have the same Jazz tag, use numbering as a tie breaker. :white_check_mark: ----- Basically use tags+songID for the optimal rank ground truth. :white_check_mark: ----- 100% clicks on lowest songID with matching tag. :white_check_mark: ----- End of bootstrap: defined as every tag has been search for and is inside a ClickLog :white_check_mark:
Outcome of bigger experiment: simple and synthetic convergence. Lots of people search and ClickLog gets filled. Think of a synthetic validation experiment.
ToDo for discussion: last grand experiment of thesis?

At this point I just need to determine what kind of click modeling I'm going to do, if any. We have the "lowest songID tiebreaker" but if there's no tie, it will "flip a coin" where it chooses the top result 50% of the time, and a random result in the list 50% of the time, to add some variance.

We need to decide if this is sufficient, or how we want to model this further. If we wanted to go further into the statistics, we could use a skewed sampling method for weighting songs with lower songID's more heavily, so that we can demonstrate that by gossiping the clicklog the most popular songs across all devices start to converge towards a global truth.

I could contrast this with a uniform random sampling that shows that if users choose returned results randomly, over a sufficiently large number of queries we will still see a convergence towards the most popular songs. Mind you, the rate of convergence depends entirely upon how many results per query are shown to a user. The higher the number of results shown, the longer it will take for results to converge across nodes.

The question, then, is what it is exactly we want to demonstrate at the end of the day.

awrgold commented 1 year ago

The median % of nodes containing the most popular song per tag, over time. Right now, gossip rounds occur every 100 queries (globally) and each node gossips with 10 nodes per gossip round.

As we can see this grows logarithmically. (This is for a 100-node network)

awrgold commented 1 year ago

Updated notebook: https://github.com/awrgold/distributed-learn-to-rank/blob/main/Testing%20v1.1%20-%20Expanded%20Clicklog.ipynb

synctext commented 1 year ago

Another thesis milestone: you have a 1000-people simulated graph. That is what you need for graduation level. 10k in-memory is obviously cooler :-)
- wrap up with: new/enhanced/tweaked algorithm (compare with self-build basic algorithm)
- wrap up with: amazing last grand experiment (e.g. compare ML to non-ML Cubit, 2007)
- wrap up with: deployment to 10 users
- Self-Question: how would an AI expert like Neil-York (or TBM) judge the merit of thesis?
reading thesis
Title
- follow the latest hype, why do you not call it "AI on the information super highway"? 1990s term, now outdated.
- "Web3" is the latest re-decentralise instantiation. Plus battle call for the forces against market concentration and growth of fragility.
- P2P file sharing is soooo November 1999.
- Local-first Unsupervised Learn-to-Rank for Web3
introduction
- Now its old-skool file sharing
- Please do either AI (your core expertise, handy) or Web3 storyline
- Suggestion: start with decentralised ML, bring in Cubit (May 2007) and how it has not yet matured beyond academic playground. You focus on the problem of bridging the reality gap, deploying this theory in the real world and validating correctness. So don't mention pure P2P, focus on mixed ML and decentralisation exclusively. Napster had no ML, so no need to distract the user with this.
- This thesis merges two disjoint fields of science: system builder which originate for Operating Systems (System III, 1981) and AI people which invented backpropagation (1969)
- We propose an extremely simple machine learning mechanism with surprising non-functional properties. We aim to seed the field of social-responsible AI with a proven correct starting point. Our work is deliberately as-simple-as-possible to enable human comprehension. This enables us to deploy this technology, observe behaviour in the real world, measure the real-world performance and verify its correct operation. Our secondary goal is to craft the minimal viable starting point for further research by enabling cost/benefit analysis of complexity on top of our basic fully functional skeleton.
- Figure 1: unification of scientific fields with their decades of evolution
- superpretentious: all the big stuff below flows from left to right. With "this thesis" as the DAG end-point
- AI - 1969: Backpropagation {more ML}, Distributed Machine Learning (2012)
- OS - 1981: System III; UUCP (1979); Napster (1999); Tribler (2005)
- Mobile - Infopad Berkley (1992), Wifi-Direct (2008), Local-first (2019)
- Search - PageRank (19..)...{insert more} Cubit (May 2007)
- Ethics - electronic privacy (1992), web privacy (Tor 2001), social reponsible AI (2021)
"III. APPROACH"
- Rename section: Architecture of ??"AndrewSearch"??
- ClickLog centric storyline. Remove platform talk
- Example ClickLog snippet/Figure
- ClickLog and then brief tutorial on how that empowers leaderless fully decentralised search. No single-point-of-failure,
- "III.A Unsupervised Ranking Model": compress to 50% of size, less bachlor level, add citations, ?add a magic classic formula?.
- "Fig. 3. Clicklog" either "user issued query" or "user entered search term". Be as clear as possible here.
Experiments and performance evaluation
- To re-do all this coding from scratch: 2 weeks of effort
- Estimate to redo all this coding from scratch and use Kotlin data structures: months. Github co-pilot experiment..
- Estimate when using a mature library with basic tutorial 3 weeks??? Use https://github.com/holgerbrandl/krangl ??? Their data model docs
- Second trivial validation experiment idea
- the goal of this second experiment is to demonstrate the personalisation feature of our skeleton algorithm GoldML.
- We use 6 disco songs and 4 opera aria's to obtain two district user profiles. The behaviour we desire of our algorithm is that a single search and click will gravitate our next relevance ranking into either disco or aria.
- For the 10 items we also have ...
- The following graph shows that ...
- TODO: "think of a minimal experiment which demonstrates beyond the shadow of a doubt that your as-simple-as-possible algorithm has personalisation "

awrgold commented 1 year ago

Result for 1000 nodes. As the network grows large, the growth rate slows dramatically

awrgold commented 1 year ago

Thesis Timeline

I was told to toot my own horn in public here

awrgold commented 1 year ago

Another thesis milestone: you have a 1000-people simulated graph. That is what you need for graduation level. 10k in-memory is obviously cooler :-)

wrap up with: new/enhanced/tweaked algorithm (compare with self-build basic algorithm)

I'm working on an updated/enhanced version of the algorithm, particularly focusing on utilizing a ranking scheme based on a distance metric between users (based purely on known clicklog entries).

I'm currently in the final stages of building a Jaccard distance-ranking scheme where a node looks at its clicklog to determine which nodes have the highest similarity of searches and clicks to itself (utilizing the Jaccard distance)

I am standardizing the distance metric by dividing each node's relative score by the maximum score in the list, so the "most-similar" user to another has a score of 1.0, and the least similar 0.0.

This allows nodes to increase/decrease in relative similarity score as clicklogs grow larger, without having an ever-growing distance metric.

We have a meeting on Monday, but the last week has been pretty uneventful for me. I did a solid programming sprint on that uber-long flight from London to Phoenix, but besides for that the last week has been relatively low productivity. Next week I have a few days before Thanksgiving, and then it will be another long 4-day weekend.

After Thanksgiving (28th November) I'll be back to working full-time on this. I won't be working at the startup much anymore so it'll be a true full-time sprint from here on out.

synctext commented 1 year ago

Reviewing your draft Master thesis on Overleaf

In many such settings, the parameters
of the model (e.g. the weights and biases of a trained neural
network) are often passed via messages between nodes in the
network, training on the local data before passing the model
parameters to the next node for further training.

An example of this related work: Backdoor Defense in Federated Learning Using Differential Testing and Outlier Detection and cosine similarity + reputation: CONTRA: Defending against Poisoning Attacks in Federated Learning

The Problem Description should only about Problems, not solutions, experiments or designs. The unsupervised method detailed herein focuses on ranking query search results relative to one another, i.e. pairwise comparison across all potential results.
IndexError("single positional indexer is out-of-bounds") in your latest master thesis notebook
Let me read the manual for you :rofl: np.intersect1d(x, y, return_indices=True)
{repeating}Compare to the 2007 algorithm of matching called Cubit "Approximate Matching for Peer-to-Peer Overlays with Cubit" ?
Current idea: compare gossip-based distributed versus global knowledge.
- algorithm efficiency is closing gap between distributed and global knowledge (e.g. oracle)
- open question: Gossip speed and intensity is just a parameter. more bandwidth into clicklog gossip vs. smarter clicklog processing
- P2P search engine died with P2P winter.
- add to above graph {instead of 300 man taking their shirt off} with citations Problem: happens in just 8 years. staggered?
- 2003: 57 citations Scalable P2P search
- 2004: 17 citations Towards a Fully Distributed P2P Web Search Engine
- 2005: 175 citations Minerva: Collaborative p2p search
- 2007: 18 citations Approximate Matching for Peer-to-Peer Overlays with Cubit
- 2009: 17 citations Challenges in Personalizing and Decentralizing the Web: An Overview of GOSSPLE
- 2010: 12 citations Deetoo: Scalable unstructured search built on a structured overlay
- 2011: 11 citations Semantic p2p search engine
- 2021: 6 citations {revival of this field} Bringing Decentralized Search to Decentralized Services
Suggestion for thesis title: A P2P Search engine without any training dataset
- Is the usage of unsupervised learning sufficiently novel in this context to upgrade to main storyline?
- Web3 is not so cool as market went bankrupt and another Enron2.0 got exposed
- why is there so little related work
- claim: unsupervised learn-to-rank barely explored ???
- speculation: nobody in industry likes losing money on decentral search, no money making, public infra
- speculation: nobody in academia has the resources at industrial-scale
- Its again cool to De-Google The Internet :medal_military:
- Or do fancy science deep deep on "Non-contrastive self-supervised learning"

@devos50 insight: this is not one-shot learning. Its continuous learning. Related work: Lifelong machine learning: a paradigm for continuous learning. Emphasis that your work is light-weight or smartpone related work: Cost-Efficient Continuous Edge Learning for Artificial Intelligence of Things. This makes the following master thesis title suggestion: P2P search engine using continuous learning for edge devices. {with extreme decentralisation: P2P Search Engine based on Extreme Decentralisation and Continuous Learning.}

Two week sprint focus: finish writing of first thesis sections (FINAL thesis quality writing please)

awrgold commented 1 year ago

Two week sprint focus: finish writing of first thesis sections (FINAL thesis quality writing please)

Did one of your recent edits remove the changes that we wanted to see in the first two sections? I remember seeing some bullet points regarding changes (I see the one regarding the problem description), but I remember during our meeting that we had decided to focus more on the decentralized unsupervised aspect of everything.

Were there any other points besides the problem description re-write?

awrgold commented 1 year ago

You might like this graphic, across all nodes for every gossip round I looked at the mean and median distances between the global best ranking, and the rankings each node would get for each query term.

Shows that after just a few gossip rounds most nodes are very close to the best possible ranking for all query terms.

I did a scatter plot of the same metrics, but instead of just the mean and median, I did a scatter plot for the average distance each node's searches is from the global optimum:

EDIT: Here's the same data, but instead a random subset of 10% of the nodes gossip to a random subset of 50% of the network (more of a multicast/broadcast than a gossip, I know...) whereas the above graph is where 10% of the nodes gossip to 10% of the network.

Analysis also showed that the outlier in the first graph was a node that only received one round of gossip at the very beginning (bad luck I guess) and therefore was not able to improve its ranking scores.

Also, what about this format? I need to find a better title because it's not quite exactly descriptive of what the data says, but I like how we can show the progression over time:

awrgold commented 1 year ago

Besides for writing, the only thing I'm working on code-wise (besides visualizations) is trying to determine why the outliers are not converging over time.

My theory is that my simulation environment is 100 nodes and 10,000 queries, and each gossip round selects at uniform random 10% of the nodes in the network. Because there are a few terms that have fewer queries (purely by statistical chance) and a few nodes that are not receiving as much gossip as others, the clicklogs with these queries are not being shared as frequently.

One of the questions I have for our upcoming meeting involves how we want to discuss the influence of gossip on this model. Do we want to have various simulations with varying gossip parameters? The size and completeness of local vs. global clicklogs has a very strong influence on the performance of the model.

This influence is also something I could investigate, as we can see a direct inverse correlation between the growth of the local clicklog and the distance between local and global optimal rankings. We just need to decide how far we want to investigate this.

All of my simulations so far have been gossiping to a static number of nodes. I have a good feeling that I can improve results quite a bit by one or both of the following:

(trivial) Increasing the number of nodes per gossip round
(relatively easy but not trivial) All nodes currently have a list of nodes that they're "aware of," which plays a part in how they're being recommended results. I can leverage this list of nodes that they're aware of to ensure that they gossip with some nodes they are familiar with, as well as passing along messages to other nodes they are unfamiliar with.

Option (2) will lead to a more rapid dissemination of clicklog data in early gossip rounds, at the expense of early-round clustering of users, meaning in early stages users will have their search results more heavily influenced by the overall number of clicks an item receives, with the drawback being that user clusters will take more gossip rounds to coalesce. The question is, how deep into gossip do we want to go?

synctext commented 1 year ago

how deep into gossip do we want to go?
- Not very deep
- Focus on your expertise: federated learning, unsupervised learning, continuous learning
- Maximum 2 figures on parameter variation of gossip.
- ~~All nodes currently have a list of nodes that they're "aware of,"~~ Please don't. This idea is a huge attack vector. You can use it to create unbounded amounts of fake identities and fake ClickLogs.
Current master thesis title suggestions are:
- "A P2P Search engine without any training dataset"
- "P2P search engine using continuous learning for edge devices"
As you're a learning experts, instead of gossip protocol expert I would simply write that in thesis: "Within this article we focus on end-to-end unsupervised learning pipeline. We present a minimal implementation based on first principles, without any dependency on machine learning libraries. Our architecture proves the viability of completely serverless machine learning. By design our work enables future research and performance gain on both the machine learning and gossip protocol parts. Our performance evaluation is focused mostly on unsupervised federated learning, as this is barely explored within prior work.

awrgold commented 1 year ago

This idea is a huge attack vector.

If we decide to drop this, then we're back to where we were about a month ago. If I'm to improve upon what I've had this whole time, I need to be able to derive some sort of insight from the clicklog beyond just clicks.

Is the clicklog itself not just the same exact vector? I understand that sybil attacks are a thing but even if I weren't doing this, we'd still have the same exact vector in place: i.e. sybil attack via clicklog saturation, replay attacks, altering gossip messages, etc?

I feel like my model has quite a bit of vulnerability to sybil attack in the first place, so I had kind of assumed we would be abstracting that problem away and then utilizing other mechanisms outside the model itself to mitigate such threats...

Title suggestions: P2P Search engine without any training dataset

The first one is just a longer way of saying unsupervised learning, I would keep it as "unsupervised learning"

P2P search engine using continuous learning for edge devices

Continuous learning, sure. Edge device seems like a non-personal device, though, no? Though I guess by definition it exists at the boundary of a network, so yeah.

Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network

How does that sound?

synctext commented 1 year ago

the winner is! Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network.

Just show in 1 or 2 graphs the impact of Sybil's and fake ClickLogs. Conclusion: trust problem needs solving. Measure CPU somehow to show that you have a light algorithm, as-simple-as-possible. Just laptop graphs are fine too.

synctext commented 1 year ago

Comments on this latest thesis LTR_Thesis_v1 (1).pdf Plus also 3 weeks later update: LTR_Thesis_v1 (3).pdf Below is a hopefully strictly decreasing ToDo list till master thesis completion.

[X] Abstract - Master Thesis Defense Quality :+1:
[X] Introduction - Master Thesis Defense Quality :+1:
- Minor polish :
- "There fore, these algorithms all depend upon highly centralized information architectures with thousands of skilled attendants dedicated to maintaining and improving system capabilities". Plus like: Fraud is rampant due to the lack of any trust architecture on The Internet and significant financial reward for being the No.1 search result. A significant amount of effort is spend on a privatized Internet police.
- "These algorithms must have a serverless architecture", this term also used for clouds with automagic provisioning. Prefer: "zero-server architecture".
- " yet the problem of storage and overhead persist" problem of security, storage, and overhead
- Split? "The explosive growth of blockchain and Web3 technologies has influenced a new generation of developers designing for a more decentralized web experience, yet many search and rank solutions still rely upon trusted sources, application-specific hardware, or various economic incentives in order to function effectively."
- "It is the purpose of this thesis to demonstrate that a simple, lightweight", We present a simple, lightweight....
- Mention 2011 gossip learning pioneering work and also 2012 Tribler deployment
[X] Problem Description - needs slight polish
- [ ] Not really describing the problem. Has lots of nice citations! Opening lines like: The problem of learning on edge devices combines several unsolved challenges. First we need to form an overlay between mobile devices, without any central server aid. Second, this overlay is then turned into a self-learning fabric. Third, we need to deal with security challenges without a central observer. Various approaches exist to solve these issues in isolation. No of this work has ever been combines into a functioning prototype, let alone actually deployed.
- Creating a search engine for a p2p domain that requires no training yet can converge towards an optimal ranking as if an error rate is being minimized in a supervised model would constitute a major development in p2p applications. Great opening line, please don't hide this great remark.
- [ ] Introduction figure + icon for message passing: Plus spice like no-server-logo:
- Mention security: https://scholar.google.com/scholar?q=backdoor+federated+learning&pagesize=100
- Manipulability of PageRank under Sybil Strategies
- "In many such settings, the parameters of the model (e.g. the weights and biases of a trained neural network) are often passed via messages between nodes in the network, training on the local data before passing the model parameters to the next node for further training. " Compressed scientific writing, not a wordy tutorial: 'Federated machine learning as presented in [12,13,14] uses message passing to disseminate model parameters during training'.
- [ ] "Machine learning models deployed in p2p networks are vulnerable to several specific attack vectors, namely sybil and spam attack" remove subsection heading. Re-write as an actual problem description: security within decentralised machine learning is an unsolved problem.
[ ] G-Rank architecture
- "The domain for this experiment is a simulation of a hypothetical music sharing and streaming application built upon a torrent-based p2p network such as BitTorrent." Remain generic and scientific first. decentralised ML. Not talk about your implementation yet, talk about system architecture.
- "simulation of a hypothetical music sharing and streaming application built upon a torrent-based p2p network such as BitTorrent", or less July 2001 old tech. An alternative for Google, Facebook or TikTok without any server or any central control (e.g. academically pure zero-server architecture).
- This section needs a explanatory figure. Either system architecture with key components or ClickLog fields
- No Appendix, easy-to-understand minimal clicklog in this section!
- Picture of clicklog Gossip: between phones are remarkable lack of any server
- [ ] Link your work to decade of efforts around semantic overlays
[ ] Dataset and Experimental Setup
- discuss environment, before experiments
- "E. Click Modeling". Each tick of our simulation we randomly select a single user and issue a single query. This triggers an update of the ClickLog and the latest information is appended. Next a gossip round is initiated, spreading the new ClickLog to N neighbors (1-5) ??. etc etc
- "Node Discovery" only explained, not explicitly mentioned by this term.
- Suggestion: "###Discovery bias experiment"
[ ] Adversarial Simulations and Performance analysis
- [ ] start with your goal and intro "I build it {repeating}
- Open source prototype on Github
- Have 4-5 clear isolation goal-driven experiments in 1 subsection each
- Base-line experiment - validation trivial, has expected outcome. 1 message gossip, "peak most popular item". 25 users? Not even an off-by-one error.
- Sybil attack. base-line performance. add attackers, results
  - [ ] Increasing severity of attack: upto 50% or 90% is evil plot
- Targeted term attack. base-line performance. add attackers, results
- Push versus pull. impact on spamming.
- "We fixed the gossip rate at one per hundred queries, such that every 100 queries across the network, all nodes begin gossiping as described above." This smells like an algorithm which has global knowledge (e.g. 100 query counter). Please use a local-information only algorithm.
[ ] Conclusions
[ ] Bibligraphy

awrgold commented 1 year ago

This is the performance if every single gossip round includes 5 sybil nodes performing a large number of queries and then broadcasting their clicklog to the entire network. Every gossip round introduces 5 new sybil nodes, so the number of attackers increases over time. I'm working on a parallel sim of a fixed number of sybil attackers, as well as a sim that has gaps between sybil attacks, to see if performance can improve in between sybil attacks.

If we assume an attacker has the capability of finding every node in the network, this is the worst case performance. If we assume that broadcasting to the entire network is impossible (since nodes do not forward gossip, which I guess means we should call it a multicast and not a gossip message, right?) and that nodes can only send messages to a select subset via some messaging protocol, then we should see less shitty performance.

If we increase the number of rounds between spam attacks, such that they're more sporadic, I have a feeling that we'll see the model rebound as the distance between spam attacker behavior and normal user behavior starts to diverge.

awrgold commented 1 year ago

Alright, so I fixed a few errors, a key one being that I was taking "global optimal ranking" while considering the performance for sybil-nodes. If I remove sybil-nodes from consideration of what "global optimal ranking" means, the results dramatically improve.

What I also noticed is that in my previous graphics, those were a demonstration of what would happen if attackers were capable of broadcasting their gossip to the network.

However, if gossip were performed more "securely" e.g. nodes cannot pre-select themselves to be included in gossip rounds, and nodes cannot broadcast but must multi-cast to a subset of the network, then the threat of sybil and spam attacks becomes significantly reduced.

That being said, this merits at least a brief discussion on the topic I believe. I know that there are various ways of conducting "secure gossip" without a centralized coordinator, but how much I want to mention this in the thesis is something we should discuss.

Anyways, in the scenario where sybil or spam attackers can flood their local clicklogs or purposefully click on low-ranked results to attempt to undermine the algorithm, and then are forced to wait until a proper gossip round occurs in order to disseminate their infected clicklogs, the performance naturally is much more stable and predictable.

Note: I'll include a vertical line on the graph to indicate when infected gossip gets transmitted.

synctext commented 1 year ago

However, if gossip were performed more "securely" e.g. nodes cannot pre-select themselves to be included in gossip rounds

{brainstorm no need to deep dive into gossip, just a suggestion if this matches your results}

Push versus pull experiment

After a validation and a deliberately simple experiment we now introduce attackers. Within this third experiment we introduce a number of malicious peers which conduct a Sybil attack. This experiment shows the dramatic impact of push versus pull gossip. This experiment proves that is is vital for security that malicious peers can not easily insert their polluting content with honest peers; a push architecture. With a pull architecture peers are more autonomous and decide individually the speed of incoming information, if their trust another peer, or may randomly sample from discovered peers. Malicious peers in this experiment try to push two messages per gossip round. With the pull architecture, only 1 message per gossip round is accepted. As Internet bandwidth is cheap, this simple experiment shows a first line of defence against ClickLog spam consists of a pull architecture.

G-Rank related theory work: Decentralized Stochastic Optimization with Client Sampling, EPFL

In this work, we study Decentralized Stochastic Gradient Descent (D-SGD) with node subsampling, i.e.
when only s (s ≤ n) nodes are randomly sampled out of n nodes per iteration. We provide the
theoretical convergence rates in smooth (convex and non-convex) problems with heterogeneous
(non-identically distributed data) functions.

Sprint till 29December: no new experiments. Focus on polishing, polishing, and polishing. Not a tutorial, a specification. Reproducibility is still 0, master student in your field should be able to make roughly the same graphs.

awrgold commented 1 year ago

So I have been thinking about how the simulation is "bootstrapped" where a while ago we decided that Node N would perform a search before the simulation and append this to its clicklog, followed by appending the clicklog of Node N-1 such that Node 99 has the clicklogs of every node, and Node 0 has only its own clicklog.

I think that at this stage, this is a bad way to bootstrap because what is essentially happening now (since I've removed all central coordination) is that nodes are gossiping to nodes they've received gossip from in the past, since the only knowledge of other nodes that they have is the clicklog itself. This leads to the situation where Node 99 is able to gossip to anybody, but lower-ID nodes can barely gossip at all until they (hopefully) receive some gossip from a more knowledgeable node.

Is this a good idea?

I was thinking that I could bootstrap the network where all nodes start off with a roughly even-sized (randomly initialized) clicklog where the nodeID is randomly initialized, so that all nodes have a roughly even-sized knowledge base of existing nodes, which means that gossip will be much more uniformly distributed.

What do you think @synctext?

synctext commented 1 year ago

Good find, interesting content to discuss within your thesis! Within the Section dataset and experimental setup

Discovery bias experiment

Finally, we conduct a small experiment to demonstrate the sensitivity of the node discovery process within distributed machine learning. Realistic simulations lack any centrality and thus have no central coordinator to discover other nodes. Our design integrates node discovery using the ClickLog itself! Thus a single ClickLog message provides both overlay network information for gossip and machine learning training data. The following measurement results demonstrates how our simulation is bootstrapped. Node 0 performs the first search, the start of the simulation. The Clicklog is generated and send to Node 1. The incoming ClickLog at Node 1 is processed and leads to the discovery of Node 0. The next query is.... The following graph shows for each of the 100 Nodes (listed on X-Axis) the amount of nodes (Y-Axis) they are aware of after the first 100 rounds of bootstrap gossip. This results shows that this deliberately simple bootstrap method leads to a highly skewed distribution in the number of gossip neighbors. Several more rounds of random gossip are required to obtain a healthy and balanced gossip overlay.

awrgold commented 1 year ago

The following graph shows for each of the 100 Nodes (listed on X-Axis) the amount of nodes (Y-Axis) they are aware of after the first 100 rounds of bootstrap gossip.

This will just be a 45 degree line from bottom left to top right, is that worth including do you think?

Several more rounds of random gossip are required to obtain a healthy and balanced gossip overlay.

Do we perform this gossip or just mention it?

synctext commented 1 year ago

Yes, a clarifying 45 degree line. It proves a point and validates environment. If it looks nice then also show how long the required warmup period is for a balanced overlay.

awrgold commented 1 year ago

So I just realized another major issue: if we instantiate malicious nodes after the network has been bootstrapped, they will never discover other nodes as they will never receive gossip.

Right now, I have no "node discovery" mechanism such that gossip really isn't "gossip," it's just a multicast. When a node receives gossip, he does not forward it along to other nodes.

As such, both benign and malicious nodes can join the network but will never receive any gossip because they do not appear in the clicklogs of other nodes. Therefore I wonder if it's a good idea if we have a "bootstrap" mechanism for new nodes in this network because right now we're unable to simulate any attacks. How does Tribler peer discovery work? Should we integrate that specific mechanism, or should I use the basic Gnutella peer discovery bootstrap mechanism (a set list of "home" nodes are given to each new node that joins the network).

synctext commented 1 year ago

Let's see.... yeah. Do a simple decentralised "ring discovery". Mention this mechanism and name explicitly in your thesis. Every node obtains a unique node number upon bootstrapping (e.g. identity). It only known about 1 other node. Node 0 automagically is introduced to node 1. -plus- Same construct for new spam nodes arriving later.

awrgold commented 1 year ago

Same construct for new spam nodes arriving later.

We need to be careful here, though - I don't think it's realistic to say that spamNode1 gets bootstrapped by spamNode0 because a realistic network would need its own mechanism for keeping track of the chronological order in which nodes join the network, so saying that Node N bootstraps Node N+1 would be unrealistic.

Instead, I propose using something similar to Gnutella's GWebCache system where it tries to bootstrap a node with some semi-randomly selected neighbors that the network is already aware of. (afaik there was considerable debate about GWebCache's security and effectiveness, but I digress...)

I propose that we bootstrap the initial network with the Ring Discovery method you described, but for the attackers a random node is selected to bootstrap a new incoming node with either a subset or entirety of its clicklog. If we use the sequential node.nodeID as the mechanism for bootstrapping, nodes 0-N will be bootstrapped by 1 other peer, but then all spam nodes will be bootstrapped by each other, with the exception of spamNode0 who gets bootstrapped by node N.

This essentially means that all spam nodes will be bootstrapped with the exact same clicklog, unless a spamNode joins the network, begins spamming, and then another spamNode joins the network, which then means it gets bootstrapped by all of spamNode K-1's clicklog, which is almost entirely spam, which also means it only has one potential non-spam target to spam (node N).

Does that make sense?

awrgold commented 1 year ago

I need some clarification as well: last meeting we discussed changes to the "Experimentation" section, where we said we'd have a separate section describing the dataset.

However, my notes from the meeting are ambiguous: I have written down "Dataset and Experimental Setup: Completely different section" as an item. Does this mean that "Dataset" is a section entirely on its own, and then "Experimental Setup" is also its own section? Or do we have a "Dataset and Experimental Setup" section?

If it's the former, the Dataset section I have is very very short - one single paragraph, and I can't make it longer without just writing filler bullshit.

If it's the latter, will the progression of the thesis paper go "Dataset and Experimental Setup" -> "Results" -> "Conclusion" or will it go "Dataset and Experimental Setup" -> (something else) -> "Results" -> "Conclusion"?

Right now I feel like it should be "Dataset and Experimental Setup" -> "Results" -> "Conclusion" but then that means that nothing has changed, except the name of the section.

synctext commented 1 year ago

See all my proposed sections: https://github.com/Tribler/tribler/issues/5313#issuecomment-1342886464

Good point on the ring bootstrap issue. Feel free to describe something smarter and realistic. But I would do that as a bonus, after all sections are final thesis polish.

awrgold commented 1 year ago

So big news - I discovered a majorly critical error in judgment that I had been making:

I had been telling each node to keep track of its gossip progress so that whenever it came time to gossip again, it wouldn't gossip its whole clicklog - just the parts since the last time it gossiped.

However, the way I had implemented it meant that if a node was gossiping to 5 other nodes, the first of 5 nodes would get a full gossip message and the other 4 would get nothing (since the progress was updated post-gossiping). I changed this such that all nodes gossip their entire clicklog in each gossip message (messages are still small at later stages of the simulation, huge clicklogs are still under 500KB).

By changing this, the accuracy and tenacity of the model becomes significantly better, even after an attack. Now that nodes are properly sharing their clicklogs, the model is much more robust at delivering accurate results.

However, once the spam nodes arrive, the simulation time becomes intractable. I went from a 10h full simulation to (estimated) 200h simulation. I canceled that simulation after about 11 hours, because once the spam attack happened, it would take roughly 10 minutes per query. This tells me that something regarding mass gossip with full clicklogs is extremely inefficient, even when I cast to lower-level numpy arrays to speed things up.

As such, I'm thinking of a few compromises:

Have nodes gossip to exactly 1 other node (will make recovering post-attack slower) to speed things up.
Have nodes gossip randomly-sampled subsets of their clicklog to a selection of nodes each gossip round, which will lead to more diffusion but may not be as fast to simulate.

Thoughts?

(and yes I know you want to focus on writing but I was running simulations in the background while writing and noticed some major errors (model getting progressively worse over time) and that made me do a bit of debugging, where I realized my mistake - now we're back to running sims while writing)

Previous Next