Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.74k stars 445 forks source link

literature survey + master thesis: G-Rank learn-to-rank #5313

Closed synctext closed 1 year ago

synctext commented 4 years ago

Direction changed, txt will be updated soon.

Old stuff:

synctext commented 2 years ago

Discussed Twitter & trading. Tip: https://www.helium.com/ Please use the simple story outline in notes above. Linkage to science such as principle agent problem and the term trustless is much appreciated. Next week: overleaf commenting.

synctext commented 2 years ago
awrgold commented 2 years ago

Deleted my last comment as it sounded a bit whiney. Here's some thoughts, since we missed each other today:

I'm increasingly impressed with OlympusDAO's governance such that they continue to pursue further decentralization while demonstrating a pretty robust governance structure. However this doesn't mean I don't think it's still susceptible to undermining, hostile takeovers, and the like. While doing some background reading I came several times across the concept of the "Iron Law of Oligarchy," a thesis posited by a late 19th/early 20th century philosopher who claimed that all organizations, businesses, and governments tend towards oligarchy because those who (1) spend disproportionate amounts of time and effort in governing these entities, and (2) possess the financial/social means to do (1), end up conglomerating power and influence.

Then I started diving into everything from Machiavelli, Chomsky, Lenin, Marx, Rousseau, and the like. I ended up spending too much time reading about political philosophy, both supporting everything from direct democracy to advocating for pure elitism. The reason why I did this is because these philosophical discussions were actually very prevalent in both the background readings I was doing, as well as the discussions taking place in the DAO governance channels. Many members wax philosophic about their governance, and it seems to be a hot-button issue within at least several DAOs.

However, at this point my main struggle is thus: I can easily discuss the shortcomings of many DAOs and the various means with which they can stabilize themselves, but to what depth do I include the philosophical? I can already hear you saying "none," but that doesn't stop me from being fascinated by it. Secondly, when it comes to some of the stuff you mention in the above comment such as "general stuff on attacks," I cover things such as the Dark Forest + advanced predators, oligarchical tendencies, hostile takeovers, and other forms of undermining DAOs.

What I'm unsure about is, to what degree should I prescribe changes versus just discussing the weaknesses and leaving it open-ended? I'm getting to the point where I'm reaching Semantic Satiation and the more I read what I've written, the less sense it starts to make to me.

I also started playing with the idea of a "social proof of work" concept, a rudimentary attempt at meritocracy, where DAO participants are rewarded for doing some kind of work. Even though this would be difficult/impossible to qualify in a human, is there some kind of way to create an organizational structure that defends against the current weaknesses of DAOs while still rewarding those who put the most effort into the organization?

synctext commented 2 years ago

Great insight in your progress and thinking: please stop reading more. My advice: don't stop writing, stop reading. Keep it simple, you just need to scratch the surface. Again, try to aim at the high level view. Like, computer science leaning nerds are rediscovering the world of power and politics. Without knowledge of the deep roots going back to 1833 they ignore the lessons of historical failure and start again from a fresh DAO perspective: https://en.m.wikipedia.org/wiki/Time-based_currency 😁

:rofl: :roll_eyes: :rofl: don't talk about the election details!!! Asking technical questions will get you banned on some DAO's.

Comments on your draft report (overleaf)

synctext commented 2 years ago

Epic read by BIS: https://www.bis.org/publ/qtrpdf/r_qt2112b.htm

awrgold commented 2 years ago

Interesting article, they are right about the illusion of decentralization, but then they go on to say "DeFi needs to integrate regulatory agencies in order to work," but the whole purpose of DeFi is permissionless anonymous finance, so I think the authors just miss the whole point - it's not about decentralization, it's anti-establishmentarian and anti-taxation. Not picking a side on this one, just saying that any DeFi project that abides by KYC/AML is probably doomed to either failure or at least being isolated from the rest of the market.

Although it's kind of ironic since anyone who cashes out crypto in a major country is required to KYC to convert to fiat, so they essentially create a "soft KYC" system where they're tangientially KYC'd at the point of entry/exit, but not at the application level. But if a DEX or DeFi protocol required users to KYC in order to participate, I'd bet the house that it would be laughed off the stage.

Good references and charts to use in the survey though.

synctext commented 2 years ago

Thesis title and focus? "Zero taxation DAO technology"?? Digitize this and make a single-click tax evasion experience, see Australia Professor work in this episode: https://www.npr.org/sections/money/2016/03/16/470722656/episode-390-we-set-up-an-offshore-company-in-a-tax-haven plus https://www.nytimes.com/2012/07/29/magazine/my-big-fat-belizean-singaporean-bank-account.html?_r=1&pagewanted=all Real academic source by Cambridge Professor Jason Sharman book is available: https://doi.org/10.1017/CBO9781107337848 Article with https://www.journalofdemocracy.org/articles/the-rise-of-kleptocracy-laundering-cash-whitewashing-reputations/ this abstract:

how globalization enables grand corruption, as well as the laundering of kleptocrats’ finances and reputations.
Shell companies and new forms of international investment, such as luxury real-estate purchases, serve to launder
the ill-gotten gains of kleptocrats and disimbed them from their country of origin. Critically, this normalization
of “everyday kleptocracy” depends heavily on transnational professional intermediaries: Western public-relations
agents, lobbyists and lawyers help to recast kleptocrats as internationally respected businesspeople and
philanthropic cosmopolitans.
synctext commented 2 years ago

brainstorm. Do a 2 week sprint. The goal is 5 scientifically publishable graphs for Thesis/Arxiv. Strategy: minimal sustainable DAO. Avoid the useless generic framework trap. Go for the "wiki" route, extendable, open ended and flexible. Specific, simple, and actionable that works. So focus on the actual usability in some Bitcoin-based scenarios such as: weekly payment for code, documentation, videos, blog posts, labour, or content in general. Weekly payment is meticulously slected to avoid the withholding of payment problem and escrow issues. We assume a complete lack of trust and simply assume parties have an incentive to keep the weekly product/payment cycle going. Future work obviously to use reputation and CVs. See our prototype of DevID

First sprint goal: Taproot 100 people transaction. State of FROST? https://github.com/Tribler/tribler/issues/5984#issuecomment-913454963 Lots of info

Poster title: "Zero taxation DAO technology"

synctext commented 2 years ago

Taproot is essential for a DAO that scales, without any dedicated coin for governance. Clear goal for 2 week: no trustchain, no superapp. Just determine Taproot status and if there is running code or features left for implementation.

devos50 commented 2 years ago

https://github.com/ElementsProject/secp256k1-zkp/pull/138 has matured quite a lot. I think it's the best candidate to generate Bitcoin-compatible Schnorr signatures.

synctext commented 2 years ago

Lesson learned: DAO is not yet ready for a pre-production focused master thesis. Too immature.

If we decide to leave the DAO topic for something more machine learning, this is what I found: learning-to-rank {without using any Big Tech server}. Hopefully fits your skills better. This thesis would be the first step to the decentralisation of Google-like search engines. The step from a academic project without any sustained usage, to actual Internet-deployment and sustained usage in the wild. That first step also means a narrow scope for feasibility. Focus is on Creative Commons content in Bittorrent (1 million songs dataset), playlists, crowd-sourced tags, and possible creation of relevance ranking ground truth. solution: federated learning on Android.

Some state-of-the-art:

synctext commented 2 years ago

Master thesis kick-off meeting

awrgold commented 2 years ago

Federated ML on Android/edge devices in most cases relies on a centralized server for federated averaging and pushing model updates. In a decentralized network, numerous extra steps are required.

Steps to creating FML on Android:

Challenges:

synctext commented 2 years ago
synctext commented 2 years ago

Recent related work: https://arxiv.org/pdf/2105.04761.pdf "Federated Unbiased Learning to Rank", by Apple Inc.

awrgold commented 2 years ago

So this million-song dataset is, uh... difficult to utilize. It's over 10 years old in an hdf5 format that requires an older version of Java, but when I try to create a Kotlin project with an appropriate JDK version I cannot build the project. I've looked at various solutions posted online with no success. The million-song dataset website has a "help" link for Java that leads to a 404 error: https://support.hdfgroup.org/products/java/hdf-java-html/

Stack Overflow has some thoughts, but none work for me at this moment: https://stackoverflow.com/questions/36385398/java-hdf5-library-install

A github repository converts it to CSV format via python: https://github.com/AGeoCoder/Million-Song-Dataset-HDF5-to-CSV

However, openCSV (a java library) hasn't been working for me. A solution posted here is my primary reference, but I'm still getting errors reading the file: https://stackoverflow.com/questions/44061143/read-csv-line-by-line-in-kotlin

At this point I'm going to be making some progress in Python, and we can discuss how to port whatever it is that I do in Python back into Kotlin. There are some open-source solutions for running Python safely and robustly in an Android environment, such as BeeWare. I know this isn't a long-term solution but right now I just need to make some kind of progress, and can worry about making it work on Android in the future.

Edit: I'll start with the FMA user metadata so I can get a proper model up and running ASAP, and then shift towards the FML side of things next.

synctext commented 2 years ago

"An error occurred accepting the competition rules."

http://archive.ics.uci.edu/ml/datasets/FMA%3A+A+Dataset+For+Music+Analysis goes: http://archive.ics.uci.edu/ml/machine-learning-databases/00386/fma.txt points: https://github.com/mdeff/fma finally: https://os.unil.cloud.switch.ch/fma/fma_full.zip ```Creative Commons 106,574 tracks from 16,341 artists``` Social data: http://millionsongdataset.com/sites/default/files/thisismyjam/jam_to_msd.tsv http://millionsongdataset.com/thisismyjam/ Conclusions: you can spend 90% of your AI time on data engineering! Solution: shortcut for at least end of May'22.
awrgold commented 2 years ago

image Apparently the Million Song Dataset was one of the first Kaggle challenges 10 years ago and the highest f-score anyone ever achieved was 0.17...

synctext commented 2 years ago

Related work, please study and explain to me: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

awrgold commented 2 years ago

image Dataset Engineering.zip

Basic KMeans on engineered user data - results are crap but it does do clustering.

synctext commented 2 years ago
Starting project, simple gradient descent without any library bloat. Evolve further: Learning to Rank using Gradient Descent. Most simple possible machine learning and most simple distributed systems algorithms: 1 packet per second of 1000Bytes outgoing as an update + parse 1 incoming packet per second. Key point is that exact filename keyword matching does not work nicely. Critical focus on distributed architecture and Internet-deployed federated learning-to-rank (self-organising). Focus is keyword search: term matching a set of known objects. (idea: tags are nice crowdsourced substitute) Instead of a dataset, use a 10 item manual testset. Example of dataset using Clicklog approach: User Term Object Rank when clicked
4388 Love song My personal written declaration of love 7
4388 Soft Rock quiet background passion in C 11
4389 Happy Hardcore ...

Use clicklog to gather from live network with Creative Commons torrents ??: image

awrgold commented 2 years ago

Part 1:

Here's what I understood about the problem so far (before our Friday meeting):

Part 2:

This last point regarding searching leads me to the next point on my mind: Is the focus here recommending users new content, or am I focusing primarily on serving up search results? What we discussed a few months ago when we pivoted, it sounded like a recommender system was what we wanted: Users interact with the app and are recommended new content based on their usage and the usage of other similar users (i.e. Collaborative Filtering).

In a distributed context, collaborative filtering is especially interesting as a topic because it naturally deals with sparse matrices - not only does it continue to work if I only have access to a subset of the data, the manner in which the global dataset is somewhat arbitrary - as long as users occasionally connect to the network and gossip with each other, the datasets will eventually update and models across devices will eventually converge. This model not only allows users to be recommended new content, it allows for a model to not be dependent upon a full dataset - it only gets more powerful the more devices are connected to the network.

However, our most recent conversation discussed Learning to Rank - a search algorithm that can indeed be used as a recommender system, but is thoroughly well-researched. This most recent meeting flipped my understanding of the entire project on its head. If I'm assuming I have access to data and I'm not focusing on consensus, gossip, or networking problems, am I just implementing a 20 year old algorithm and then calling that a thesis? I'm less concerned about the type of algorithm, and more concerned about what the goal is (and the context of the goal).

If I'm doing something with Learning to Rank, to what degree does the decentralized network aspect come into play? If I abstract away the distributed aspect of this problem, is my goal just to implement a ranking algorithm and then create an API such that the global model is able to converge across devices? Again, what's the goal here?

Part 3:

You've said multiple times that you don't want me to use "bloated libraries" but many of these machine learning libraries are far more optimized than anything I could write in a thesis. Many are even designed for android. Google even has a plethora of resources devoted specifically towards on-device machine learning, especially for transfer learning - a core tenet of federated machine learning.

However, if I'm just plugging in these algorithms and deploying them to a device, that's not a thesis - it's just a long homework assignment. If I'm writing them from scratch instead of using a library, I'm essentially (badly) redesigning the wheel - equally pointless and boring.

I'm of the opinion that there is zero chance I can improve upon the state of the art for any singular ML algorithm, nor can I improve upon the state of the art for a distributed protocol. The question is, then, what is a master's thesis-worthy problem statement here?

I think that it's important the research question is defined as soon as possible so that I can dissect the problem down into atomic pieces so I can start moving down a specific path instead of just moving in a general direction.

synctext commented 2 years ago
synctext commented 2 years ago
awrgold commented 2 years ago

I get why you want to use the Bad Panda Records music, but if I'm going to get a proper dataset (10k items or more) that's going to be a pretty big process of extracting the metadata from each of the files and then building and cleaning the dataset.

Is metadata from copyrighted music really a big deal? Especially since the dataset has existed for such a long time and is also in the commons?

Also, the lastfm dataset will have information such as timbre, "danceability," "energy," instrumentalness, "liveness," "speechyness," synthetic attributes, which will help create user profiles far better than just song metadata such as artist, album, key, etc.

awrgold commented 2 years ago

Alright, so here's where I stand currently. I'm diving into the world of (online) unsupervised ranking and learning in distributed systems. Examples include (many of which you linked):

Why is this an unsupervised task? Well, there is no labeled training set. We're trying to learn about the data available to us as it becomes available, starting from zero. This is also an online learning problem because of course the model needs to update as it sees new data.

There's also possibly an emphasis to be made on user privacy, but I don't think that it's worth diving down that particular rabbit hole.

At this point, I'd like to start working with a very small dataset as we discussed, where a small (random) subset of the larger dataset is made available to a new network participant. We'll assume in this model that all nodes in the network are trustworthy bootstrap nodes. They'll provide some information (the subset) about what they listen to, and this will be the start of the model.

A user that joins and searches for a keyword that does not exist in this subset will then have its query propagated throughout the network via gossip until it finds some kind of match. Each datapoint in the subset will be ranked according to some form of relevance score. How this relevance score is computed is something I'm still looking into. Most of what I find is for supervised learning, e.g. pointwise, pairwise, or list-wise. However, traditionally for each of these methods you need a training set of documents (songs) that are correctly labeled, and then you predict on new unseen data.

But since this is a p2p network, our starting point is an empty set + whatever we're given via the bootstrap process.

Therefore, learning to rank here is going to be purely based on the values of each feature vector, plus whatever sort of string match we make with each feature itself. Over time, as the network size grows, so too will the subset to learn from.

I was therefore thinking that some of the metrics that we could use to measure the performance goes beyond just accuracy or similarity, we can look at message size, round trip time (considering that queries may need to be propagated to remote locations of the network) and so forth.

Anyways, my thinking is that the next step is to use a single node with a "simulated" bootstrap process (e.g. it is given a tiny subset of the data, as we discussed 10-25 data points) and do some ranking on some common yet arbitrary search terms. Given a set of queries Q = {q1, ... , qn} and list of documents (songs ) D = {d1, ... , dn} rank D for each qi.

Afterwards, that's when the learning protocol comes into play. I still think that there's a novel architecture to be made here beyond collaborative filtering, but maybe one of the main focuses could be a traditional distributed collaborative filtering model vs. a novel architecture vs. a baseline model (simple gradient descent).

The problem I foresee is that in a p2p network, if we're performing gossip learning while still (ostensibly) trying to maximize privacy, we don't want a node to know which peers are returning the best results. Routing the query hits recursively back to the original sender is what Gnutella uses, and I need to learn more about what other protocols use, but either way I am unsure if I establish some kind of "link" between two nodes that have similar music tastes while ensuring that there's no way for each user to be aware of the specific identity of the other.

This is where I'm at currently. Still reading some more on the topic, will update soon.

synctext commented 2 years ago
synctext commented 2 years ago
user user search query click position Track name Magnet
1 smooth jazz 3 Marvu – Underground Jazz [2010] [EP] magnet:?xt=urn:btih:77efa41fc2eb77996b2fd2d2e7ea65b8db7f915c&dn=%5BFT005%5D+-+Marvu+-+Underground+Jazz+-+2010+-+Fantomton+-+mp3&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2 smooth jazz 1  Is World – Turning [2010] [EP] magnet:?xt=urn:btih:593643871726261298e4806a64f6a3be4ecaa3b3&dn=Is+World-+Turning+EP+MP3+vO&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2 stars collide 1  ...And Stars Collide – ...And Stars Collide [2008] [EP] magnet:?xt=urn:btih:077293393f826c6bc1ccf8ff5377e325a88fbf77&dn=And_Stars_Collide-And_Stars_Collide-(EP)-2008-FNT&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2 jazz 8  Sadsic – [sun] [2009] [EP] magnet:?xt=urn:btih:ad04b4cab2d25322b55f941ad64d18f1ae28138e&dn=Sadsic+-+%5Bsun%5D&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3 jazz 11  Decks – Yr Sucha Deck [2011] [EP] magnet:?xt=urn:btih:9d90d01ae957e1ca4aaab7339d380e6d3750bfaf&dn=Decks+-+Yr+Sucha+Deck+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3 relaxing jazz 8  Seatraffic – Seatraffic [2011] [EP] magnet:?xt=urn:btih:c7aa866a3ab0c45535392bd4e8cf0564590e7139&dn=Seatraffic+(MP3+320)&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3 sleepy jazz 1  Awake in Sleep – Awake in Sleep [2011] [EP] magnet:?xt=urn:btih:9a5547367153e156c6a57344b829d0411fc5a073&dn=Awake+in+Sleep+-+Awake+in+Sleep+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3 lively music  5 Howth – Belly of the Beast [2011] [Single] magnet:?xt=urn:btih:64eb3dacb580fb99a51ad8e11a6963d3544f0efd&dn=Howth+-+Belly+of+the+Beast+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
awrgold commented 2 years ago

image

Manual 10-item dataset

awrgold commented 2 years ago

Gossip Learning Distributed Data.pdf

devos50 commented 2 years ago

Possibly helpful for this line of work: Anime Recommendation Engine notebooks + a blog post.

synctext commented 2 years ago

thnx, nice related work! Confirms that this is totally unexplored space, especially for web3. their focus: centralised context.

awrgold commented 2 years ago

Alright, so here's my update. I've put on a helmet as I expect some (virtual) items to be thrown at me.

By far my biggest hurdle right now is data structures in Kotlin. A dataset containing not just information regarding songs information (title, artist, etc) but also things like play count and particularly clicklog data requires a N-dimensional data structure of various data types. I ideally want to use this for experiments and examples, as I am not looking to write complex classes and loops that take in various individual data structures and "append" them together. At least not yet.

Therefore, I need a n-dimensional array of columns where each column is (potentially) a different data type, and each row contains multiple data types, e.g. a DataFrame as we're used to working with in Python. This will help me with small examples to get the ball rolling on this project.

There are numerous options for this in Kotlin/Java, such as krangl, Kotlin DataFrame, and Tablesaw. Okay, so I can potentially use these, at least for now.

However, being unfamiliar with Kotlin I'm moving slowly to the point where I'm writing code in Python first (I know objects are about to be thrown at me) so that I can at least visualize what I need to write in Kotlin. Problem is I'm struggling with lambda expressions that I'm used to in Python, so I'm doing a lot of learning on the fly how to do things in Kotlin.

Right now, here is what I have:

I have a "network" of N nodes, all of which contain random subsets of a "global" dataset. They are able to "communicate" with each other by gossiping with nodes they're already familiar with (ignoring all network topology, I'm pretending the network layer exists at the moment).

I introduce new nodes to be bootstrapped into the network by a random node (for now, all existing nodes are capable of bootstrapping a new node), where the bootstrap process provides the new node the clicklog belonging to the bootstrapping node. This will allow the new node to search for new items from a (very small) dataset.

What I am working on is a basic search algorithm where a node first searches its own local dataset/clicklog to rank results, and then broadcasts the search to the entire network, returning all (yes, all) results that contain a string match for the search query. All results received are appended into a temporary data structure that will then be used for basic ranking. Once the results are ranked, and a user "chooses" (an arbitrary function right now) a result, the clicklog is updated.

I am currently working on what "updating" means. For now, I'm leaning towards appending this data to the existing clicklog, incrementing the clicklog for each particular item (if necessary). As the network grows large, I am concerned that storing all search results will grow too large, but that's not a problem I'm focusing on now.

Some things I'm thinking about:

In particular, you have an inherent trade-off: either a node stores as much information it learns about the network to perform as much computation locally (more data storage, faster learning/search time) or it relies on message-passing to traverse the network and build a model based on averaging (or something else) over time, in which case nodes communicate with each other more frequently, (more time, less space overhead).

Anyways, this is mostly where I'm at and I'm feeling okay, besides the numerous holes in my office wall from banging my head due to learning a new language on the fly. I guess there's a reason why people prefer python for ML...

synctext commented 2 years ago

Epic milestone! A rough first working(??) prototype for further refinement and learnings. What is the ML like? Just exchange 1 clicklog message of 5KByte every second with random other peer?

awrgold commented 2 years ago

There is no ML, right now I'm just focusing on returning results with exact string matches, sorted by the number of clicks.

As for propagating the clicklog, I will focus on that later. Having serious data structure issues. Kotlin is not a language built for robust data science, and I'm facing the possibility of having to design my own system which I really don't want to do.

synctext commented 2 years ago

Current status: toy example in Notebook is up and running Please remove all this complexity for next meeting. As simple as possible is a scientific goal. 1-dimensional, clean and simple. 5 fields, no album, no genre, no play_count, no click_count, no artistID, no artist.

Most simple unsupervised machine learning, where each node simple has a duplicate of the complete clicklog. Let next master students and experienced engineers worry about the perfect and real-time gossip layer. Out of scope for now. Just assume everything is everywhere, all at once. However, it grows continuously in real-time. Single table, in which every node knows its own "user ID". This toy example data then contains signal beyond string matching, like best matches for user query soft elevator music. Another bunch of master students can worry about adversarial machine learning. A whole industry exists to trick search engines into higher rankings using fraud, deception, and fake content.

Next step to work towards and scientific contribution for a master thesis figure: bootstrapping your clicklog from 0 entries towards 254 entries (embarrassing low number of items --- no need to go bigger than free database of other master student)

user user search query click position Track name Magnet
1 smooth jazz 3 Marvu – Underground Jazz [2010] [EP] magnet:?xt=urn:77efa41fc2eb77996b2fd2d2e7ea65&dn=%5BFT005%5D+-+Marvu+-+Underground+Jazz

Tribler code removed had this simple approach and was deployed (failed due to technical debt in Tribler itself): https://github.com/Tribler/tribler/pull/1227/files#diff-1f54650a282a51132d456befe211ee30ffa50fe7382401bd4107a5ab9116bc74

CREATE TABLE MyPreference (
  torrent_id     integer PRIMARY KEY NOT NULL,
  destination_path text NOT NULL,
  progress       numeric,
  creation_time  integer NOT NULL,
  -- V2: Patch for BuddyCast 4
  click_position INTEGER DEFAULT -1,
  reranking_strategy INTEGER DEFAULT -1
);

CREATE TABLE UserEventLog (
  timestamp      numeric,
  type           integer,
  message        text
);

Please put your code in a public repo. Unique scientific contribution is self-organising intelligence. With a leaderless bunch of smartphone and simplistic gradient decent we can already claim self-* AI. Thus scientific core of thesis and novel. Academic proof-of-principle. As said before, there are hard requirements for your master thesis. It has run on Android. Please use Python for bootstrapping only.

Andrew idea: Use Gibbs sampling within this master thesis to obtain the required scientific depth and wall-of-math (sarcasm)

btw clicklog data is notoriously hard to get. Controversial AOL dataset. data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.. ~20M web queries collected from ~650k users over three months. Problem with research dataset of ClickLog. Front page article the New York Times reporters track down a specific AOL user within the released ClickLog. manager is fired over ClickLog. No datasets for science. As scientists we need to deploy privacy-first ML and study it in-situ. Slows science down, this thesis produces first Web3-based search engine which respects privacy and avoid fake decentralisation. US ISPs sell ClickLogs for $5/month/user

awrgold commented 2 years ago

https://github.com/awrgold/distributed-learn-to-rank-v1/blob/main/Building%20in%20Python.ipynb

devos50 commented 2 years ago

Since I recently also got interested in this topic and worked with decentralized Federated Learning during my recent research visit, let me give my two cents on this topic.

One difficulty of the intersection between ML and decentralized systems (and maybe IR) is that it encompasses a lot. It’s hard to focus, but at the same time, there is almost no work on getting this up and running in a decentralized network. Probably because there are many unsolved questions, e.g., Sybil attack, message propagation, bandwidth costs, model poisoning, etc. So, any contribution in this area is already novel.

In particular, you have an inherent trade-off: either a node stores as much information it learns about the network to perform as much computation locally (more data storage, faster learning/search time) or it relies on message-passing to traverse the network and build a model based on averaging (or something else) over time, in which case nodes communicate with each other more frequently, (more time, less space overhead).

Absolutely, there's a trade-off here, probably between multiple properties (privacy, computation requirements, and bandwidth?). Interesting stuff for future research.

Not sure how good Kotlin is for data science/ML stuff. I have good experience with Python and Pytorch, the latter is pretty straightforward to use. I did work with Deeplearning4J in the past but I found it very complicated to get up and running. To get some initial results, and for quicker model prototyping, I highly recommend Python.

There's also a difference between recommendation and learn-to-rank I think. At least to me, learn-to-rank seems to be a subset of recommendation methods.

As a generic approach, I recommend first focusing on getting some (simple!) model up and running, fully centralized and with a single node. Then assess the performance of this model and compare it with the reported baseline in the paper that introduces it. Then start to worry about data distribution/network topologies. For every model that I used in a recent paper, I first wrote some scripts to train and tune the parameters. This gave me a decent baseline that I could use to evaluate correctness when decentralizing the training.

Anyway, interested in learning more about this!

awrgold commented 2 years ago

@devos50 Exactly my thoughts as well. The tradeoff you described is very real, and which side I focus on is something I'm considering heavily right now.

Additionally, I'm looking specifically at item-item collaborative filtering as it eliminates the need for user matrices to be passed along as messages, meaning it has a bit more inherent privacy involved (of course messages can be encrypted/decrypted but malicious users can still infer information about users if they're valid recipients of messages).

When it comes to the distributed networking aspect, I've taken all of the blockchain and distributed system courses offered in the master's program but I'm still a data scientist by training so my specialty is more on the algorithms and statistics. As such I am doing a lot of underlying assumptions regarding the networking layer, at least for now.

With respect to L2R you are correct in that it's somewhat of a subset of recommendation, and recommender systems can be both supervised and unsupervised (or semi-supervised) in nature, which does indeed put them in a class of their own. A model that I'm chewing on right now performs a ranking based purely on predicted relevance based on the search term, in descending order. That alone can be a robust model, however there's a major gulf to be bridged as you mentioned: the amount of data stored locally versus globally.

At this point I'm assuming a universal global knowledge of the network topology as well as a global clicklog, but later iterations will likely have fragmented datasets and will rely on averaging the model via message passing (e.g. this model here).

awrgold commented 2 years ago

Progress update:

Thoughts:

I'm working on the second point, but need to understand the problem more. This last week was hectic as it was the last week of the sports season before the summer so I was quite preoccupied. This is now my primary focus: expanding the ranking mechanism based on both simple logic as well as established literature.

synctext commented 2 years ago
awrgold commented 2 years ago

Some of the major papers I'm looking at (to be updated later):

When it comes to a keyword search for a specific term that has:

the first thing to do is return a (sub)set of songs, and whichever gets clicked on becomes the #1 result associated with that term. Call it a "random walk?" Over time this term will (ostensibly) become populated with more and more results.

As each item gets clicked, the search terms that led to the specific item being clicked become more and more "similar".

Collaborative filtering based on each user's clicklog then leads to learning to rank from previously unseen search terms.

synctext commented 2 years ago

Please keep going on your own path. {note keeping} Prior running ML recommendation code from Delft that now has some overlap with your learn-to-rank work. code {more note keeping, privacy in gossip:} No worries if the gossip leaks private info in this generation of learn-to-rank. We have crypto party tricks ready for the next master student to jump on. Privacy for keyword searches in encrypted domain: "Forward and Backward Private Searchable Encryption from Constrained Cryptographic Primitives". Famous last words: just make it work; we will fix security and privacy later!

synctext commented 1 year ago
synctext commented 1 year ago

somewhat related work from last month. AI people trying silly key exchange.. double layered crypto.. NttpFL: Privacy-Preserving oriented No Trusted Third Party Federated Learning System based on Blockchain, https://ieeexplore.ieee.org/document/9802707

awrgold commented 1 year ago

Been ill the last week or so, just started recovering yesterday. Right now I have this:

A simulated network with global knowledge - all users know all users, and each user has 100% of the song database available to them at all times. The traditional version of collaborative filtering uses user ratings of some kind. There are many ways to do this, whether binary (like/dislike), ternary (0/1/2), or a rating system (say 1-5 stars). I am using a ternary rating system at the moment:

0 = never searched for 1 = appeared in a search 2 = clicked on in a search

and from here I can create a comprehensive rating system for each possible result. However, when somebody searches for a term, I'm still looking for exact string matches that are sorted by the highest relevance rankings from this aforementioned rating system. This isn't ideal, and is the point of focus for me right now to figure out a better way to rank beyond a binary "is string match? -> descending rank results"

awrgold commented 1 year ago

Alright, so right now I'm reaching a bit of a complexity barrier: if I have a global dataset with ~250 songs with individual/global clicklogs, if I'm using matrix factorization to determine the similarity between songs that appear in the search such that I can rank them by similarity, the problem becomes factorial in complexity. I can of course constrain this to like 5 or 10 search results to reduce complexity of course, but then I'm thinking I might need to do something like TF-IDF or BERT or something like that to go beyond just string matching of search terms to song attributes, that way it doesn't return ALL the songs that have common words like "the" in the title. Otherwise I can just manually ignore common "stop words" for now, which will definitely save me time but is not "robust."

synctext commented 1 year ago

cool! so you have your 2==clicked on in search. Please document in master thesis format (1-2pages) your data structure and algorithm. Are you heading towards a minimal clicklog of [search-term, user, torrent] data structure? Please only use that for your ML. Also, avoid any string matching for what user clicks on. Use a 'magic' user preference function. Seeing beyond string matching is what makes it so awesome.

synctext commented 1 year ago

btw you should keep doing just a clicklog. You now added also a not-clicked-log. That is just a distraction. Performance tuning like that is for later master thesis student.

awrgold commented 1 year ago

The challenge is this, though: using the clicklog dataset with 250 songs, I need to determine some form of similarity between each item. Item-item collaborative filtering requires some form of scoring/rating function per user; that is, I need to know what each user thinks of each song in order to assign is a score value. Once I have this for each song pair for each node, I have a massive permutation of songs.

What I was using string matching for was to essentially prune the dataset down dramatically, e.g. if they search for a term and a song/artist/album/genre does not contain this term, it is not considered for ranking. This makes the problem significantly more tractable. Without it, I need to somehow return a pruned list of results, or every search will be creating truly significant computational overhead.