literature survey + master thesis: G-Rank learn-to-rank

synctext commented 4 years ago

Direction changed, txt will be updated soon.

Old stuff:

synctext commented 2 years ago

Discussed Twitter & trading. Tip: https://www.helium.com/ Please use the simple story outline in notes above. Linkage to science such as principle agent problem and the term trustless is much appreciated. Next week: overleaf commenting.

synctext commented 2 years ago

If 1 page takes over a day: you're doing it wrong :hammer: :boom: :smile_cat:
Little bit of practical stuff: screenshots of client, protocol specs, etc.
You are not graduating within economics, no pseudo-fancy impact models.
Table of upcoming projects, effort level, desired outcome.
General stuff on attacks (in scope??):
- https://medium.com/block-science/operationalizing-the-gitcoindao-anti-sybil-process-7f2595544f44
- A Stealthier Partitioning Attack against Bitcoin Peer-to-Peer Network
- On the Routing-Aware Peering against Network-Eclipse Attacks in Bitcoin
- 12 Jan post - example https://github.com/Tribler/tribler/files/7667698/74407305-Open2edit-Poster.pdf
- keep it simple
- Who is going to win? CBDC, Web3, Big Tech, Boomer Banks.

awrgold commented 2 years ago

Deleted my last comment as it sounded a bit whiney. Here's some thoughts, since we missed each other today:

I'm increasingly impressed with OlympusDAO's governance such that they continue to pursue further decentralization while demonstrating a pretty robust governance structure. However this doesn't mean I don't think it's still susceptible to undermining, hostile takeovers, and the like. While doing some background reading I came several times across the concept of the "Iron Law of Oligarchy," a thesis posited by a late 19th/early 20th century philosopher who claimed that all organizations, businesses, and governments tend towards oligarchy because those who (1) spend disproportionate amounts of time and effort in governing these entities, and (2) possess the financial/social means to do (1), end up conglomerating power and influence.

Then I started diving into everything from Machiavelli, Chomsky, Lenin, Marx, Rousseau, and the like. I ended up spending too much time reading about political philosophy, both supporting everything from direct democracy to advocating for pure elitism. The reason why I did this is because these philosophical discussions were actually very prevalent in both the background readings I was doing, as well as the discussions taking place in the DAO governance channels. Many members wax philosophic about their governance, and it seems to be a hot-button issue within at least several DAOs.

However, at this point my main struggle is thus: I can easily discuss the shortcomings of many DAOs and the various means with which they can stabilize themselves, but to what depth do I include the philosophical? I can already hear you saying "none," but that doesn't stop me from being fascinated by it. Secondly, when it comes to some of the stuff you mention in the above comment such as "general stuff on attacks," I cover things such as the Dark Forest + advanced predators, oligarchical tendencies, hostile takeovers, and other forms of undermining DAOs.

What I'm unsure about is, to what degree should I prescribe changes versus just discussing the weaknesses and leaving it open-ended? I'm getting to the point where I'm reaching Semantic Satiation and the more I read what I've written, the less sense it starts to make to me.

I also started playing with the idea of a "social proof of work" concept, a rudimentary attempt at meritocracy, where DAO participants are rewarded for doing some kind of work. Even though this would be difficult/impossible to qualify in a human, is there some kind of way to create an organizational structure that defends against the current weaknesses of DAOs while still rewarding those who put the most effort into the organization?

synctext commented 2 years ago

Great insight in your progress and thinking: please stop reading more. My advice: don't stop writing, stop reading. Keep it simple, you just need to scratch the surface. Again, try to aim at the high level view. Like, computer science leaning nerds are rediscovering the world of power and politics. Without knowledge of the deep roots going back to 1833 they ignore the lessons of historical failure and start again from a fresh DAO perspective: https://en.m.wikipedia.org/wiki/Time-based_currency 😁

:rofl: :roll_eyes: :rofl: don't talk about the election details!!! Asking technical questions will get you banned on some DAO's.

Comments on your draft report (overleaf)

zero scientific articles cited!
No more additional topics or directions please; just make it readable now & tech grounded.
wall-of-text format, more blog post, white paper style. Please make a boring, grounded, well-defined, thus polished article.
- "OlympusDAO has acquired a massive war chest of roughly $700 million", no proof to back up this blog-style claim.
- no linkage of your descriptive work to running code, scientific articles, white paper quotes, tech documentation, actual Solidity, or architectural docs.
- talk about the facts
other analysis by others of DAOs [REFS]
"OlympusDAO’s meteoric success in bootstrapping enormous amounts of liquidity has shifted the paradigm of decentralized finance away from the highly unsustainable practice of renting liquidity in exchange for rapidly inflating reward tokens" make it scientific with statistics (market cap graphs), references, or other evidence
Gnosis safe, no technical depth. Only money&power talk.

synctext commented 2 years ago

Epic read by BIS: https://www.bis.org/publ/qtrpdf/r_qt2112b.htm

awrgold commented 2 years ago

Interesting article, they are right about the illusion of decentralization, but then they go on to say "DeFi needs to integrate regulatory agencies in order to work," but the whole purpose of DeFi is permissionless anonymous finance, so I think the authors just miss the whole point - it's not about decentralization, it's anti-establishmentarian and anti-taxation. Not picking a side on this one, just saying that any DeFi project that abides by KYC/AML is probably doomed to either failure or at least being isolated from the rest of the market.

Although it's kind of ironic since anyone who cashes out crypto in a major country is required to KYC to convert to fiat, so they essentially create a "soft KYC" system where they're tangientially KYC'd at the point of entry/exit, but not at the application level. But if a DEX or DeFi protocol required users to KYC in order to participate, I'd bet the house that it would be laughed off the stage.

Good references and charts to use in the survey though.

synctext commented 2 years ago

Thesis title and focus? "Zero taxation DAO technology"?? Digitize this and make a single-click tax evasion experience, see Australia Professor work in this episode: https://www.npr.org/sections/money/2016/03/16/470722656/episode-390-we-set-up-an-offshore-company-in-a-tax-haven plus https://www.nytimes.com/2012/07/29/magazine/my-big-fat-belizean-singaporean-bank-account.html?_r=1&pagewanted=all Real academic source by Cambridge Professor Jason Sharman book is available: https://doi.org/10.1017/CBO9781107337848 Article with https://www.journalofdemocracy.org/articles/the-rise-of-kleptocracy-laundering-cash-whitewashing-reputations/ this abstract:

how globalization enables grand corruption, as well as the laundering of kleptocrats’ finances and reputations.
Shell companies and new forms of international investment, such as luxury real-estate purchases, serve to launder
the ill-gotten gains of kleptocrats and disimbed them from their country of origin. Critically, this normalization
of “everyday kleptocracy” depends heavily on transnational professional intermediaries: Western public-relations
agents, lobbyists and lawyers help to recast kleptocrats as internationally respected businesspeople and
philanthropic cosmopolitans.

synctext commented 2 years ago

brainstorm. Do a 2 week sprint. The goal is 5 scientifically publishable graphs for Thesis/Arxiv. Strategy: minimal sustainable DAO. Avoid the useless generic framework trap. Go for the "wiki" route, extendable, open ended and flexible. Specific, simple, and actionable that works. So focus on the actual usability in some Bitcoin-based scenarios such as: weekly payment for code, documentation, videos, blog posts, labour, or content in general. Weekly payment is meticulously slected to avoid the withholding of payment problem and escrow issues. We assume a complete lack of trust and simply assume parties have an incentive to keep the weekly product/payment cycle going. Future work obviously to use reputation and CVs. See our prototype of DevID

First sprint goal: Taproot 100 people transaction. State of FROST? https://github.com/Tribler/tribler/issues/5984#issuecomment-913454963 Lots of info

Poster title: "Zero taxation DAO technology"

Future of the firm. Cyber economy is already here, see Big Tech.
anonymous transactions without regulatory enforcement
Shell companies and offshore banking is replaced by crypto
any economic activity is facilitated
unregulatory Bitcoin commerce
Proof-of-principle prototype: force political action
Master thesis goal: actual 1+ Bitcoin of legal economic activity
Delft technology facilitates joint ownership of money by millions of millions.
Taproot technology using Schnorr threshold signatures
Missing piece of puzzle, absent in all related DAO work: Trustchain

synctext commented 2 years ago

Taproot is essential for a DAO that scales, without any dedicated coin for governance. Clear goal for 2 week: no trustchain, no superapp. Just determine Taproot status and if there is running code or features left for implementation.

devos50 commented 2 years ago

https://github.com/ElementsProject/secp256k1-zkp/pull/138 has matured quite a lot. I think it's the best candidate to generate Bitcoin-compatible Schnorr signatures.

synctext commented 2 years ago

Lesson learned: DAO is not yet ready for a pre-production focused master thesis. Too immature.

If we decide to leave the DAO topic for something more machine learning, this is what I found: learning-to-rank {without using any Big Tech server}. Hopefully fits your skills better. This thesis would be the first step to the decentralisation of Google-like search engines. The step from a academic project without any sustained usage, to actual Internet-deployment and sustained usage in the wild. That first step also means a narrow scope for feasibility. Focus is on Creative Commons content in Bittorrent (1 million songs dataset), playlists, crowd-sourced tags, and possible creation of relevance ranking ground truth. solution: federated learning on Android.

Some state-of-the-art:

this field is extremely hot. 14 surveys are mentioned in this 15th survey. survey-of-surveys, never seen anything like it.
In this paper, we propose a Federated OLTR method, called FPDGD, which leverages the state-of-the-art Pairwise Differentiable Gradient Descent (PDGD) and adapts it to the Federated Averaging framework.
How to Put Users in Control of their Data in Federated Top-N Recommendation with Learning to Rank
http://cs.gmu.edu/~carlotta/publications/ICMR14.pdf
https://www.ramb.ethz.ch/CDstore/www2011/proceedings/p277.pdf
https://www.researchgate.net/profile/Eugene-Kharitonov/publication/331655464_Federated_Online_Learning_to_Rank_with_Evolution_Strategies/links/5d2a14b3458515c11c2e1944/Federated-Online-Learning-to-Rank-with-Evolution-Strategies.pdf
https://www.cs.cornell.edu/people/tj/publications/yadav_etal_21a.pdf
https://drive.google.com/file/d/14MoG-C56urc0Hg7yu_r_y8fYEObPjL_m/view
To avoid at all cost: secure multi-party computation type of solutions {high overhead}
instant starting point. running code plus datasets https://github.com/sisinflab/FedBPR

synctext commented 2 years ago

Master thesis kick-off meeting

Re-using the knowledge of master courses: speciality of math and statistics
Regular weekly meetings at Delft lab.
Make it work! its the operational architecture that matters. The primary judgement for this master thesis shall be the Internet-deployed technology, deploy at all {performance} cost.
first step to the decentralisation of Google-like search engines. (keyword search, learn-to-rank)
First running prototype
- keep its simple cmdline in Kotlin, no GUI, No APK, no library, no decentralisation, no big data, no attacks (pollution,Sybils,DoS, false promotion, etc.)
- No magic 300MByte library to do something deep. From scratch, no deeplearning4j.
- (part of the) million song dataset; keep it to 10k item for now
- centralised for now, 1 central device
- clean slate approach
- Create playlists, items, users and/or similarity or something something. 80% testing 20% performance evaluation
- Random output for now, start with this simplistic code and low performance. more code here
- Graph on your thesis ticket!

awrgold commented 2 years ago

Federated ML on Android/edge devices in most cases relies on a centralized server for federated averaging and pushing model updates. In a decentralized network, numerous extra steps are required.

Steps to creating FML on Android:

Create a model (even just basic regression is fine) based on data (X) and target labels (Y)
From this, perform inference and (re)train the model (iteratively regress or whatever) and take a "snapshot" of the current state of the model: the network needs to know how to recreate each step of the training process
The past "checkpoint" will be stored on the user's device, even if it's a fresh model ("zero checkpoint")
Extract the (current) weights from the on-device model (for deep learning) or the step-by-step updates to the model from the snapshot and push to the ~~server~~ decentralized network
Ask the ~~server~~ network to perform federated averaging across each device's results
Push the results to the devices, which queue immediately after the current checkpoint. Synchronizing these in an asynchronous network is challenging.

Challenges:

Security: if any device can start pushing model checkpoints to the network, the entire model can be trivially undermined
Model consensus
Malleable encryption or do we not care about that?

synctext commented 2 years ago

No checkpoints please, no deeplearning4j, no deep learning at all, no GUI, No APK, no library, no decentralisation, no big data, no attacks (pollution,Sybils,DoS, false promotion, etc.)
Single variable gradient descent linear regression in Kotlin
Opportunity for your Web3 search engine. Has Google search become quantitatively worse?
https://github.com/thomasnield/kotlin_linear_regression/blob/master/src/main/kotlin/Model.kt#L239
dataset focus for this sprint. Repeating: Create playlists, items, users and/or similarity or something something. 80% testing 20% performance evaluation
Graph on your thesis ticket!

synctext commented 2 years ago

Recent related work: https://arxiv.org/pdf/2105.04761.pdf "Federated Unbiased Learning to Rank", by Apple Inc.

awrgold commented 2 years ago

So this million-song dataset is, uh... difficult to utilize. It's over 10 years old in an hdf5 format that requires an older version of Java, but when I try to create a Kotlin project with an appropriate JDK version I cannot build the project. I've looked at various solutions posted online with no success. The million-song dataset website has a "help" link for Java that leads to a 404 error: https://support.hdfgroup.org/products/java/hdf-java-html/

Stack Overflow has some thoughts, but none work for me at this moment: https://stackoverflow.com/questions/36385398/java-hdf5-library-install

A github repository converts it to CSV format via python: https://github.com/AGeoCoder/Million-Song-Dataset-HDF5-to-CSV

However, openCSV (a java library) hasn't been working for me. A solution posted here is my primary reference, but I'm still getting errors reading the file: https://stackoverflow.com/questions/44061143/read-csv-line-by-line-in-kotlin

At this point I'm going to be making some progress in Python, and we can discuss how to port whatever it is that I do in Python back into Kotlin. There are some open-source solutions for running Python safely and robustly in an Android environment, such as BeeWare. I know this isn't a long-term solution but right now I just need to make some kind of progress, and can worry about making it work on Android in the future.

Edit: I'll start with the FMA user metadata so I can get a proper model up and running ASAP, and then shift towards the FML side of things next.

synctext commented 2 years ago

ToDo:
- post first Kotlin ML code from repo
- playlist dataset shortcut
- first music recommendation
- Graph on your thesis ticket!

"An error occurred accepting the competition rules."

http://archive.ics.uci.edu/ml/datasets/FMA%3A+A+Dataset+For+Music+Analysis goes: http://archive.ics.uci.edu/ml/machine-learning-databases/00386/fma.txt points: https://github.com/mdeff/fma finally: https://os.unil.cloud.switch.ch/fma/fma_full.zip ```Creative Commons 106,574 tracks from 16,341 artists``` Social data: http://millionsongdataset.com/sites/default/files/thisismyjam/jam_to_msd.tsv http://millionsongdataset.com/thisismyjam/ Conclusions: you can spend 90% of your AI time on data engineering! Solution: shortcut for at least end of May'22.

awrgold commented 2 years ago

Apparently the Million Song Dataset was one of the first Kaggle challenges 10 years ago and the highest f-score anyone ever achieved was 0.17...

synctext commented 2 years ago

Related work, please study and explain to me: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

awrgold commented 2 years ago

Dataset Engineering.zip

Basic KMeans on engineered user data - results are crap but it does do clustering.

synctext commented 2 years ago

Starting project, simple gradient descent without any library bloat. Evolve further: Learning to Rank using Gradient Descent. Most simple possible machine learning and most simple distributed systems algorithms: 1 packet per second of 1000Bytes outgoing as an update + parse 1 incoming packet per second. Key point is that exact filename keyword matching does not work nicely. Critical focus on distributed architecture and Internet-deployed federated learning-to-rank (self-organising). Focus is keyword search: `term` matching a set of known `objects`. (idea: tags are nice crowdsourced substitute) Instead of a dataset, use a 10 item manual testset. Example of dataset using Clicklog approach: User	Term	Object	Rank when clicked
4388	Love song	My personal written declaration of love	7
4388	Soft Rock	quiet background passion in C	11
4389	Happy Hardcore	...

Use clicklog to gather from live network with Creative Commons torrents ??:

awrgold commented 2 years ago

Part 1:

Here's what I understood about the problem so far (before our Friday meeting):

We have a music streaming app. I know it's torrent-based and decentralized, but beyond this it was my current understanding that the consensus and networking layers don't matter, and that I should assume that I have full access to the entirety of the data required to build models.
- However, this makes a broad assumption that (to my knowledge) removes the entire concept of "federated" machine learning from the problem. Without this, I essentially am writing a machine learning algorithm for a singular device.
  - For example, if User A searches for Song X, I want to have [machine learning model] return the most relevant results. However, in order to do this I need to know what other users are doing. Am I just assuming that the model is able to access this information, and abstract the networking, gossip, and consensus mechanism out of the equation? It sounded like the answer was "yes" in which case I can just implement this in a few weeks and call it a day.
Whenever a user searches for content on this application, they need to be shown results. This is the crux of my thesis. When users search for something, I don't just want to return exact matches: I want to return similar content - not just in terms of string matching, but in terms of content.
- If a user searches for Song X, I want to be able to do the following (many of these are data engineering problems - the algorithm can do these things given the proper data):
  - Return Song X
  - Return songs similar to Song X. Similarity is determined by artistry and usership, not by analyzing the song itself.
  - If Song X is of Genre A, I return songs from Genre A that are similar to Song X
    - Songs in Genre A are determined to be similar to Song X based on things such as:
      - Being in a playlist with Song X
      - Made by the same artist as Song X
      - Made by artists who have collaborated with the artist of Song X
      - etc.
  - Return string matches to Song X (such as covers, remixes, etc)
- If a user searches for Artist X, I want to be able to do the following:
  - Return the artist page for Artist X
  - Return the most popular songs from Artist X
  - Return results similar to Artist X, with conditions similar to the ones listed for Song X above (remixes, collaborations, covers, etc)
- If a user searches for Term T but there are no exact matches, I want to be able to do the following:
  - Return results similar to Term T
    - Similarity, here, is more difficult to define. It can be things that other users listen to that have also searched for Term T, similar strings to Term T, etc.
    - This is the most difficult to analyze. I'm open to ideas/discussion about this.

Part 2:

This last point regarding searching leads me to the next point on my mind: Is the focus here recommending users new content, or am I focusing primarily on serving up search results? What we discussed a few months ago when we pivoted, it sounded like a recommender system was what we wanted: Users interact with the app and are recommended new content based on their usage and the usage of other similar users (i.e. Collaborative Filtering).

In a distributed context, collaborative filtering is especially interesting as a topic because it naturally deals with sparse matrices - not only does it continue to work if I only have access to a subset of the data, the manner in which the global dataset is somewhat arbitrary - as long as users occasionally connect to the network and gossip with each other, the datasets will eventually update and models across devices will eventually converge. This model not only allows users to be recommended new content, it allows for a model to not be dependent upon a full dataset - it only gets more powerful the more devices are connected to the network.

However, our most recent conversation discussed Learning to Rank - a search algorithm that can indeed be used as a recommender system, but is thoroughly well-researched. This most recent meeting flipped my understanding of the entire project on its head. If I'm assuming I have access to data and I'm not focusing on consensus, gossip, or networking problems, am I just implementing a 20 year old algorithm and then calling that a thesis? I'm less concerned about the type of algorithm, and more concerned about what the goal is (and the context of the goal).

If I'm doing something with Learning to Rank, to what degree does the decentralized network aspect come into play? If I abstract away the distributed aspect of this problem, is my goal just to implement a ranking algorithm and then create an API such that the global model is able to converge across devices? Again, what's the goal here?

Part 3:

You've said multiple times that you don't want me to use "bloated libraries" but many of these machine learning libraries are far more optimized than anything I could write in a thesis. Many are even designed for android. Google even has a plethora of resources devoted specifically towards on-device machine learning, especially for transfer learning - a core tenet of federated machine learning.

However, if I'm just plugging in these algorithms and deploying them to a device, that's not a thesis - it's just a long homework assignment. If I'm writing them from scratch instead of using a library, I'm essentially (badly) redesigning the wheel - equally pointless and boring.

I'm of the opinion that there is zero chance I can improve upon the state of the art for any singular ML algorithm, nor can I improve upon the state of the art for a distributed protocol. The question is, then, what is a master's thesis-worthy problem statement here?

I think that it's important the research question is defined as soon as possible so that I can dissect the problem down into atomic pieces so I can start moving down a specific path instead of just moving in a general direction.

synctext commented 2 years ago

Focus on Learn-to-Rank. Keep it as simple as possible: gradient decent if possible (performance is not a concern)
Dataset is needed of user, keyword-search, object-clicked, rank-when-clicked
- As an alternative for keyword search terms we could use tags. Suggest you use Bad Panda Record, like Brian. They use Creative Commons music, feature magnet links, and rich metadata. Panda is better than tags of copyrighted stuff in LastFM dataset
- with the above data, we are still missing the user-level stats :zap:
Goal is to get it live and federated and every user action improves the live system
- Deployed federated machine learning
- Live system inside deployed MusicDAO
- clicklog statistics are spread and learn-to-rank gets better with usage
- start with gradient decent for the simplicity, ability to adjust, and upgrade.
Possible roadmap
- [ ] first skeleton 100 line dataset
- [ ] first skeleton running code with learn-to-rank: local ranking of search results
- [ ] expand dataset and ?freeze?
- [ ] decentralisation. expand clicklog with federated part; gossip
- [ ] upgrade learn-to-rank sophistication
- [ ] first live system feedback
Basic literature search on "learn-to-rank clicklog"
Exact improvement beyond state-of-the-art options
- just get the architecture deployed is already innovative, never before did we have fully self-organising zero-server federated machine learning for learn-to-rank
- fake clicklog data, attack resilience
- cpu usage focus
- privacy-preservation of clicklog info
Ambition: before December finish master thesis

synctext commented 2 years ago

Learn-to-Rank Grand idea: deploy system which gets better with usage.
Stop with the cheating and shortcuts (e.g. Python, kmeans, jupyter)
Focus: clicklog, gradient decent, and "use a 10 item manual testset." 8 tracks Creative Commons already EDIT: https://github.com/brian2509/pandacd-scrape/blob/master/export/220408/assets/output.csv

awrgold commented 2 years ago

I get why you want to use the Bad Panda Records music, but if I'm going to get a proper dataset (10k items or more) that's going to be a pretty big process of extracting the metadata from each of the files and then building and cleaning the dataset.

Is metadata from copyrighted music really a big deal? Especially since the dataset has existed for such a long time and is also in the commons?

Also, the lastfm dataset will have information such as timbre, "danceability," "energy," instrumentalness, "liveness," "speechyness," synthetic attributes, which will help create user profiles far better than just song metadata such as artist, album, key, etc.

awrgold commented 2 years ago

Alright, so here's where I stand currently. I'm diving into the world of (online) unsupervised ranking and learning in distributed systems. Examples include (many of which you linked):

Why is this an unsupervised task? Well, there is no labeled training set. We're trying to learn about the data available to us as it becomes available, starting from zero. This is also an online learning problem because of course the model needs to update as it sees new data.

There's also possibly an emphasis to be made on user privacy, but I don't think that it's worth diving down that particular rabbit hole.

At this point, I'd like to start working with a very small dataset as we discussed, where a small (random) subset of the larger dataset is made available to a new network participant. We'll assume in this model that all nodes in the network are trustworthy bootstrap nodes. They'll provide some information (the subset) about what they listen to, and this will be the start of the model.

A user that joins and searches for a keyword that does not exist in this subset will then have its query propagated throughout the network via gossip until it finds some kind of match. Each datapoint in the subset will be ranked according to some form of relevance score. How this relevance score is computed is something I'm still looking into. Most of what I find is for supervised learning, e.g. pointwise, pairwise, or list-wise. However, traditionally for each of these methods you need a training set of documents (songs) that are correctly labeled, and then you predict on new unseen data.

But since this is a p2p network, our starting point is an empty set + whatever we're given via the bootstrap process.

Therefore, learning to rank here is going to be purely based on the values of each feature vector, plus whatever sort of string match we make with each feature itself. Over time, as the network size grows, so too will the subset to learn from.

I was therefore thinking that some of the metrics that we could use to measure the performance goes beyond just accuracy or similarity, we can look at message size, round trip time (considering that queries may need to be propagated to remote locations of the network) and so forth.

Anyways, my thinking is that the next step is to use a single node with a "simulated" bootstrap process (e.g. it is given a tiny subset of the data, as we discussed 10-25 data points) and do some ranking on some common yet arbitrary search terms. Given a set of queries Q = {q1, ... , qn} and list of documents (songs ) D = {d1, ... , dn} rank D for each qi.

Afterwards, that's when the learning protocol comes into play. I still think that there's a novel architecture to be made here beyond collaborative filtering, but maybe one of the main focuses could be a traditional distributed collaborative filtering model vs. a novel architecture vs. a baseline model (simple gradient descent).

The problem I foresee is that in a p2p network, if we're performing gossip learning while still (ostensibly) trying to maximize privacy, we don't want a node to know which peers are returning the best results. Routing the query hits recursively back to the original sender is what Gnutella uses, and I need to learn more about what other protocols use, but either way I am unsure if I establish some kind of "link" between two nodes that have similar music tastes while ensuring that there's no way for each user to be aware of the specific identity of the other.

This is where I'm at currently. Still reading some more on the topic, will update soon.

synctext commented 2 years ago

Goal: be the world-first! Nobody has academically pure decentralised federated learning deployed. So without any cheating, no cloud, no coordinator, no rendezvous server, and no central server of any kind. SearchZero
remember we're after the practical approach (we want to connect this to our privacy-enhancing tech later + trustchain based security; still future work; however, no query forwarding)
first production-usage of federated machine learning
what is the ground-truth? (supervised or unsupervised)
- use the signal that people click on the 2nd item consistently, thus clicklog
- You have zero error if everybody clicks the first suggested item
- start from zero almost. online learning. Unseen data is key to handle correctly in-time.
Are we ready to formulate the gradient decent model using clicklog, live system, and training data as input?
Sprint: MvP. first gradient decent with zero starting data, 10-line manual text file and query in cmdline prompt. no networking yet.
EDIT: note we deployed an operational clicklog system and removed it 7 years ago with a privateUserEventLog and click_position. It never worked nicely enough to keep.

synctext commented 2 years ago

cold start problem: there is no click-log data and no user search queries.
Can we have a few lines bootstrap dataset, not MvP style. Toy example only :warning:
Synthetic user 4 issues these queries: "smooth music", "Seatraffic", "lively music" {we need a fake ranking}
sort of a self-play architecture
- one conducts relevance ranking
- another machine learning algorithm issues queries and selects the "best match", gets recorded into clicklog.

user	user search query	click position	Track name	Magnet
1	smooth jazz	3	Marvu – Underground Jazz [2010] [EP]	magnet:?xt=urn:btih:77efa41fc2eb77996b2fd2d2e7ea65b8db7f915c&dn=%5BFT005%5D+-+Marvu+-+Underground+Jazz+-+2010+-+Fantomton+-+mp3&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2	smooth jazz	1	Is World – Turning [2010] [EP]	magnet:?xt=urn:btih:593643871726261298e4806a64f6a3be4ecaa3b3&dn=Is+World-+Turning+EP+MP3+vO&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2	stars collide	1	...And Stars Collide – ...And Stars Collide [2008] [EP]	magnet:?xt=urn:btih:077293393f826c6bc1ccf8ff5377e325a88fbf77&dn=And_Stars_Collide-And_Stars_Collide-(EP)-2008-FNT&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
2	jazz	8	Sadsic – [sun] [2009] [EP]	magnet:?xt=urn:btih:ad04b4cab2d25322b55f941ad64d18f1ae28138e&dn=Sadsic+-+%5Bsun%5D&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3	jazz	11	Decks – Yr Sucha Deck [2011] [EP]	magnet:?xt=urn:btih:9d90d01ae957e1ca4aaab7339d380e6d3750bfaf&dn=Decks+-+Yr+Sucha+Deck+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3	relaxing jazz	8	Seatraffic – Seatraffic [2011] [EP]	magnet:?xt=urn:btih:c7aa866a3ab0c45535392bd4e8cf0564590e7139&dn=Seatraffic+(MP3+320)&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3	sleepy jazz	1	Awake in Sleep – Awake in Sleep [2011] [EP]	magnet:?xt=urn:btih:9a5547367153e156c6a57344b829d0411fc5a073&dn=Awake+in+Sleep+-+Awake+in+Sleep+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710
3	lively music	5	Howth – Belly of the Beast [2011] [Single]	magnet:?xt=urn:btih:64eb3dacb580fb99a51ad8e11a6963d3544f0efd&dn=Howth+-+Belly+of+the+Beast+(2011)+320&tr=udp%3A%2F%2Ftracker.pandacd.io%3A2710

awrgold commented 2 years ago

Manual 10-item dataset

awrgold commented 2 years ago

Gossip Learning Distributed Data.pdf

devos50 commented 2 years ago

Possibly helpful for this line of work: Anime Recommendation Engine notebooks + a blog post.

synctext commented 2 years ago

thnx, nice related work! Confirms that this is totally unexplored space, especially for web3. their focus: centralised context.

awrgold commented 2 years ago

Alright, so here's my update. I've put on a helmet as I expect some (virtual) items to be thrown at me.

By far my biggest hurdle right now is data structures in Kotlin. A dataset containing not just information regarding songs information (title, artist, etc) but also things like play count and particularly clicklog data requires a N-dimensional data structure of various data types. I ideally want to use this for experiments and examples, as I am not looking to write complex classes and loops that take in various individual data structures and "append" them together. At least not yet.

Therefore, I need a n-dimensional array of columns where each column is (potentially) a different data type, and each row contains multiple data types, e.g. a DataFrame as we're used to working with in Python. This will help me with small examples to get the ball rolling on this project.

There are numerous options for this in Kotlin/Java, such as krangl, Kotlin DataFrame, and Tablesaw. Okay, so I can potentially use these, at least for now.

However, being unfamiliar with Kotlin I'm moving slowly to the point where I'm writing code in Python first (I know objects are about to be thrown at me) so that I can at least visualize what I need to write in Kotlin. Problem is I'm struggling with lambda expressions that I'm used to in Python, so I'm doing a lot of learning on the fly how to do things in Kotlin.

Right now, here is what I have:

I have a "network" of N nodes, all of which contain random subsets of a "global" dataset. They are able to "communicate" with each other by gossiping with nodes they're already familiar with (ignoring all network topology, I'm pretending the network layer exists at the moment).

I introduce new nodes to be bootstrapped into the network by a random node (for now, all existing nodes are capable of bootstrapping a new node), where the bootstrap process provides the new node the clicklog belonging to the bootstrapping node. This will allow the new node to search for new items from a (very small) dataset.

What I am working on is a basic search algorithm where a node first searches its own local dataset/clicklog to rank results, and then broadcasts the search to the entire network, returning all (yes, all) results that contain a string match for the search query. All results received are appended into a temporary data structure that will then be used for basic ranking. Once the results are ranked, and a user "chooses" (an arbitrary function right now) a result, the clicklog is updated.

I am currently working on what "updating" means. For now, I'm leaning towards appending this data to the existing clicklog, incrementing the clicklog for each particular item (if necessary). As the network grows large, I am concerned that storing all search results will grow too large, but that's not a problem I'm focusing on now.

Some things I'm thinking about:

Messaging (broadcasting/multicasting/gossiping) within the network. Done in epochs? How to define these epochs, if so?
What information is contained in gossip messages?
Sharing clicklog data with other nodes regularly. Good idea?
Reading papers and taking notes, particularly on distributed ML concepts.

In particular, you have an inherent trade-off: either a node stores as much information it learns about the network to perform as much computation locally (more data storage, faster learning/search time) or it relies on message-passing to traverse the network and build a model based on averaging (or something else) over time, in which case nodes communicate with each other more frequently, (more time, less space overhead).

Anyways, this is mostly where I'm at and I'm feeling okay, besides the numerous holes in my office wall from banging my head due to learning a new language on the fly. I guess there's a reason why people prefer python for ML...

synctext commented 2 years ago

Epic milestone! A rough first working(??) prototype for further refinement and learnings. What is the ML like? Just exchange 1 clicklog message of 5KByte every second with random other peer?

awrgold commented 2 years ago

There is no ML, right now I'm just focusing on returning results with exact string matches, sorted by the number of clicks.

As for propagating the clicklog, I will focus on that later. Having serious data structure issues. Kotlin is not a language built for robust data science, and I'm facing the possibility of having to design my own system which I really don't want to do.

synctext commented 2 years ago

Current status: toy example in Notebook is up and running Please remove all this complexity for next meeting. As simple as possible is a scientific goal. 1-dimensional, clean and simple. 5 fields, no album, no genre, no play_count, no click_count, no artistID, no artist.

Most simple unsupervised machine learning, where each node simple has a duplicate of the complete clicklog. Let next master students and experienced engineers worry about the perfect and real-time gossip layer. Out of scope for now. Just assume everything is everywhere, all at once. However, it grows continuously in real-time. Single table, in which every node knows its own "user ID". This toy example data then contains signal beyond string matching, like best matches for user query soft elevator music. Another bunch of master students can worry about adversarial machine learning. A whole industry exists to trick search engines into higher rankings using fraud, deception, and fake content.

Next step to work towards and scientific contribution for a master thesis figure: bootstrapping your clicklog from 0 entries towards 254 entries (embarrassing low number of items --- no need to go bigger than free database of other master student)

user	user search query	click position	Track name	Magnet
1	smooth jazz	3	Marvu – Underground Jazz [2010] [EP]	magnet:?xt=urn:77efa41fc2eb77996b2fd2d2e7ea65&dn=%5BFT005%5D+-+Marvu+-+Underground+Jazz

My belief is that our fields (hyperscaling distributed systems; and ML) will merge; https://www.gwern.net/Scaling-hypothesis
Cardinal question and my usual rant: do not use complexity, do not use complex libraries.
Simple things that have limited performance
- do we need more than: user, item, query, click
- why a N-dimensional data store, is a few tables not sufficient?
- See data structure from a colorful dataset out of Sweden: https://web.archive.org/web/20100214173000/http://thepiratebay.org/torrent/3783572/db_dump_and_query_log_from_piratebay.org__summer_of_2006

Tribler code removed had this simple approach and was deployed (failed due to technical debt in Tribler itself): https://github.com/Tribler/tribler/pull/1227/files#diff-1f54650a282a51132d456befe211ee30ffa50fe7382401bd4107a5ab9116bc74

CREATE TABLE MyPreference (
  torrent_id     integer PRIMARY KEY NOT NULL,
  destination_path text NOT NULL,
  progress       numeric,
  creation_time  integer NOT NULL,
  -- V2: Patch for BuddyCast 4
  click_position INTEGER DEFAULT -1,
  reranking_strategy INTEGER DEFAULT -1
);

CREATE TABLE UserEventLog (
  timestamp      numeric,
  type           integer,
  message        text
);

Please put your code in a public repo. Unique scientific contribution is self-organising intelligence. With a leaderless bunch of smartphone and simplistic gradient decent we can already claim self-* AI. Thus scientific core of thesis and novel. Academic proof-of-principle. As said before, there are hard requirements for your master thesis. It has run on Android. Please use Python for bootstrapping only.

Andrew idea: Use Gibbs sampling within this master thesis to obtain the required scientific depth and wall-of-math (sarcasm)

btw clicklog data is notoriously hard to get. Controversial AOL dataset. data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.. ~20M web queries collected from ~650k users over three months. Problem with research dataset of ClickLog. Front page article the New York Times reporters track down a specific AOL user within the released ClickLog. manager is fired over ClickLog. No datasets for science. As scientists we need to deploy privacy-first ML and study it in-situ. Slows science down, this thesis produces first Web3-based search engine which respects privacy and avoid fake decentralisation. US ISPs sell ClickLogs for $5/month/user

awrgold commented 2 years ago

https://github.com/awrgold/distributed-learn-to-rank-v1/blob/main/Building%20in%20Python.ipynb

devos50 commented 2 years ago

Since I recently also got interested in this topic and worked with decentralized Federated Learning during my recent research visit, let me give my two cents on this topic.

One difficulty of the intersection between ML and decentralized systems (and maybe IR) is that it encompasses a lot. It’s hard to focus, but at the same time, there is almost no work on getting this up and running in a decentralized network. Probably because there are many unsolved questions, e.g., Sybil attack, message propagation, bandwidth costs, model poisoning, etc. So, any contribution in this area is already novel.

In particular, you have an inherent trade-off: either a node stores as much information it learns about the network to perform as much computation locally (more data storage, faster learning/search time) or it relies on message-passing to traverse the network and build a model based on averaging (or something else) over time, in which case nodes communicate with each other more frequently, (more time, less space overhead).

Absolutely, there's a trade-off here, probably between multiple properties (privacy, computation requirements, and bandwidth?). Interesting stuff for future research.

Not sure how good Kotlin is for data science/ML stuff. I have good experience with Python and Pytorch, the latter is pretty straightforward to use. I did work with Deeplearning4J in the past but I found it very complicated to get up and running. To get some initial results, and for quicker model prototyping, I highly recommend Python.

There's also a difference between recommendation and learn-to-rank I think. At least to me, learn-to-rank seems to be a subset of recommendation methods.

As a generic approach, I recommend first focusing on getting some (simple!) model up and running, fully centralized and with a single node. Then assess the performance of this model and compare it with the reported baseline in the paper that introduces it. Then start to worry about data distribution/network topologies. For every model that I used in a recent paper, I first wrote some scripts to train and tune the parameters. This gave me a decent baseline that I could use to evaluate correctness when decentralizing the training.

Anyway, interested in learning more about this!

awrgold commented 2 years ago

@devos50 Exactly my thoughts as well. The tradeoff you described is very real, and which side I focus on is something I'm considering heavily right now.

Additionally, I'm looking specifically at item-item collaborative filtering as it eliminates the need for user matrices to be passed along as messages, meaning it has a bit more inherent privacy involved (of course messages can be encrypted/decrypted but malicious users can still infer information about users if they're valid recipients of messages).

When it comes to the distributed networking aspect, I've taken all of the blockchain and distributed system courses offered in the master's program but I'm still a data scientist by training so my specialty is more on the algorithms and statistics. As such I am doing a lot of underlying assumptions regarding the networking layer, at least for now.

With respect to L2R you are correct in that it's somewhat of a subset of recommendation, and recommender systems can be both supervised and unsupervised (or semi-supervised) in nature, which does indeed put them in a class of their own. A model that I'm chewing on right now performs a ranking based purely on predicted relevance based on the search term, in descending order. That alone can be a robust model, however there's a major gulf to be bridged as you mentioned: the amount of data stored locally versus globally.

At this point I'm assuming a universal global knowledge of the network topology as well as a global clicklog, but later iterations will likely have fragmented datasets and will rely on averaging the model via message passing (e.g. this model here).

awrgold commented 2 years ago

Progress update:

Most recent code is on Github
Network of arbitrary number of nodes contains:
- Global dataset of songs
- Global clicklog
Corpus of search terms to be randomly chosen during simulation (currently contains only terms found in the dataset, will expand later to include terms that result in no matches)
Nodes can search the entire network for a randomly selected search term
- Of the results, a random result is selected from the list
Ordering of search results is random
- Click position is randomly assigned for now
- No updating of ordering post-click is implemented yet
Global clicklog is updated post-search simulation (not in real time yet)

Thoughts:

You mentioned "This toy example data then contains signal beyond string matching, like best matches for user query soft elevator music" has not been implemented yet. I'm still thinking about how best to create an item-item matrix for collaborative filtering.
- That being said, the dataset of 254 entries you mentioned only has one observed feature in which meaningful information can be extracted: Title. This means that searching for "elevator music" will return no meaningful matches unless a track explicitly has "elevator" or "music" or both in the title, without one or both of the following features:
  - An NLP component (no thanks)
  - A filtering/recommendation engine in which an arbitrary term returns a list of songs, a user clicks on a track (regardless of its relevance) and the search query term now becomes associated with the clicked result (e.g. expanded clicklog).

I'm working on the second point, but need to understand the problem more. This last week was hectic as it was the last week of the sports season before the summer so I was quite preoccupied. This is now my primary focus: expanding the ranking mechanism based on both simple logic as well as established literature.

synctext commented 2 years ago

excellent momentum!!! Keep this up and you can defend your master thesis before X-mas.
Please keep the startup product launch interference to a minimum.
the core of the master thesis algorithm will be a relevance ranking function
- Learn-to-rank is therefore a relevance function
- this R() returns [0,1] and parameters are: keyword-to-search,
- [Pearson correlation coefficient]()
thesis assumptions: there is perfect gossip, clicklog will bootstrap, there are lots of users, no attackers, no spam.
Applying collaborative filtering techniques to movie search for better ranking and browsing
First solution for the thesis: 1) zero-server, 2) federated ML, and 3) solved keyword search problem
- DECIDE: use collaborative filtering + clicklog exchange or machine learning model exchange + averaging
- Robust privacy because you can shield yourself from almost the entire network, only work with tasteful people. Only interact with nodes with a high similarity.
- LEARNING TO RANK FOR COLLABORATIVE FILTERING
- I do not understand! we can not use a Item-item matrix, or traditional user-item matrix. We require something like search_term-item matrix. So above algorithm is not about keyword search ranking, but about recommender ranking. ??? Its just a form of semantic clustering which is decades old.
like Bitcoin, Bittorrent, and Wikipedia; your system only works in reality, not in theory :-)
- your thesis then gets three things: classic collaborative filtering, keyword search learn-to-rank, and zero-server decentralisation.
- title for master thesis possiblity: "local-first unsupervised machine learning"
- TODO for next meeting: get the first 2 operational inside a Jupyter notebook
- unsupervised learning of new keywords, suggest list of items, user selects 1 item, add into clicklog, and learn
- what the user clicks on: we use a deterministic random choice {a term use in literature: click modelling}
- Corpus of search terms to be randomly chosen during simulation, limited search term corpus to 10 or 100 and see convergence of same songs for a search.
- 1 graph for next meeting is promised!

awrgold commented 2 years ago

Some of the major papers I'm looking at (to be updated later):

When it comes to a keyword search for a specific term that has:

never been previously searched for
has no matches in the clicklog or dataset

the first thing to do is return a (sub)set of songs, and whichever gets clicked on becomes the #1 result associated with that term. Call it a "random walk?" Over time this term will (ostensibly) become populated with more and more results.

As each item gets clicked, the search terms that led to the specific item being clicked become more and more "similar".

Collaborative filtering based on each user's clicklog then leads to learning to rank from previously unseen search terms.

synctext commented 2 years ago

Please keep going on your own path. {note keeping} Prior running ML recommendation code from Delft that now has some overlap with your learn-to-rank work. code {more note keeping, privacy in gossip:} No worries if the gossip leaks private info in this generation of learn-to-rank. We have crypto party tricks ready for the next master student to jump on. Privacy for keyword searches in encrypted domain: "Forward and Backward Private Searchable Encryption from Constrained Cryptographic Primitives". Famous last words: just make it work; we will fix security and privacy later!

synctext commented 1 year ago

currently learning by prototype coding (artist, album title, etc.)
Please do a 1-page problem description + design write-up. Should be in master thesis writing level.
minimize deviations from thesis direction [repeating]
like Bitcoin, Bittorrent, and Wikipedia; your system only works in reality, not in theory :-)
- your thesis then gets three things: classic collaborative filtering, keyword search learn-to-rank, and zero-server decentralisation. (==hyperscale to 1 billion devices, learn how google-alternative could work)
- title for master thesis possiblity: "local-first unsupervised machine learning"
- TODO for next meeting: get the first 2 operational inside a Jupyter notebook
- unsupervised learning of new keywords, suggest list of items, user selects 1 item, add into clicklog, and learn
- what the user clicks on: we use a deterministic random choice {a term use in literature: click modelling}
- Corpus of search terms to be randomly chosen during simulation, limited search term corpus to 10 or 100 and see convergence of same songs for a search.
- 1 graph for next meeting is promised!

synctext commented 1 year ago

somewhat related work from last month. AI people trying silly key exchange.. double layered crypto.. NttpFL: Privacy-Preserving oriented No Trusted Third Party Federated Learning System based on Blockchain, https://ieeexplore.ieee.org/document/9802707

awrgold commented 1 year ago

Been ill the last week or so, just started recovering yesterday. Right now I have this:

A simulated network with global knowledge - all users know all users, and each user has 100% of the song database available to them at all times. The traditional version of collaborative filtering uses user ratings of some kind. There are many ways to do this, whether binary (like/dislike), ternary (0/1/2), or a rating system (say 1-5 stars). I am using a ternary rating system at the moment:

0 = never searched for 1 = appeared in a search 2 = clicked on in a search

and from here I can create a comprehensive rating system for each possible result. However, when somebody searches for a term, I'm still looking for exact string matches that are sorted by the highest relevance rankings from this aforementioned rating system. This isn't ideal, and is the point of focus for me right now to figure out a better way to rank beyond a binary "is string match? -> descending rank results"

awrgold commented 1 year ago

Alright, so right now I'm reaching a bit of a complexity barrier: if I have a global dataset with ~250 songs with individual/global clicklogs, if I'm using matrix factorization to determine the similarity between songs that appear in the search such that I can rank them by similarity, the problem becomes factorial in complexity. I can of course constrain this to like 5 or 10 search results to reduce complexity of course, but then I'm thinking I might need to do something like TF-IDF or BERT or something like that to go beyond just string matching of search terms to song attributes, that way it doesn't return ALL the songs that have common words like "the" in the title. Otherwise I can just manually ignore common "stop words" for now, which will definitely save me time but is not "robust."

synctext commented 1 year ago

cool! so you have your 2==clicked on in search. Please document in master thesis format (1-2pages) your data structure and algorithm. Are you heading towards a minimal clicklog of [search-term, user, torrent] data structure? Please only use that for your ML. Also, avoid any string matching for what user clicks on. Use a 'magic' user preference function. Seeing beyond string matching is what makes it so awesome.

synctext commented 1 year ago

btw you should keep doing just a clicklog. You now added also a not-clicked-log. That is just a distraction. Performance tuning like that is for later master thesis student.

awrgold commented 1 year ago

The challenge is this, though: using the clicklog dataset with 250 songs, I need to determine some form of similarity between each item. Item-item collaborative filtering requires some form of scoring/rating function per user; that is, I need to know what each user thinks of each song in order to assign is a score value. Once I have this for each song pair for each node, I have a massive permutation of songs.

What I was using string matching for was to essentially prune the dataset down dramatically, e.g. if they search for a term and a song/artist/album/genre does not contain this term, it is not considered for ranking. This makes the problem significantly more tractable. Without it, I need to somehow return a pruned list of results, or every search will be creating truly significant computational overhead.

Tribler / tribler

literature survey + master thesis: G-Rank learn-to-rank #5313

Poster title: "Zero taxation DAO technology"