Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.77k stars 443 forks source link

literature survey + master thesis: G-Rank learn-to-rank #5313

Closed synctext closed 1 year ago

synctext commented 4 years ago

Direction changed, txt will be updated soon.

Old stuff:

awrgold commented 1 year ago

image

So this is interesting: I had the thought that we would want every node in the simulation to be bootstrapped the exact same way, such that adversaries are not given any preferential or different treatment than the rest. As such, I now initialize the network as such:

Create a network with N nodes, each containing roughly 10% of the total dataset of song items. They have no clicklog at first. They they perform 10 queries to add some items to their clicklogs, randomly selecting a result each time.

Then, the bootstrap phase begins:

As such, Node 0 cannot be bootstrapped at first: it only has its local library and a single clicklog item. Node 1 must be bootstrapped by Node 0, and only gets 1 clicklog item in exchange. Node 1's clicklog is now 2x the size of Node 0's clicklog. Node 2 is bootstrapped by either Node 0 or Node 1, and so on and so on...

We still therefore have an imbalance in distribution, but what we see is that in the early stages of the network the performance varies widely, and then it begins to rapidly converge as more and more gossip occurs.

There is no gossip round anymore, they always gossip post-query, but the reason why that graph's X-axis isn't 0-100 is because I stopped the simulation early because the calculation of performance metrics was getting exponentially larger with each evaluation round, so I need to optimize my code (it was on track to be a 40-hour simulation).

awrgold commented 1 year ago

image Might have an issue... I'm 37% of the way through a single full-scale simulation and it's already taken 26 hours...

The simulation speed slows down exponentially, and I need to figure out why. I understand that they are sharing more and more of their clicklogs, but these operations shouldn't take that long. I will have to dive into the simulation, or just rent some server time and do all 8 concurrently in order to get them done, because at this rate it'll take 3 days (if it doesn't slow down any more) or longer just to finish these sims.

devos50 commented 1 year ago

It might be a memory issue. I suggest then using our dedicated infrastructure, for example, the DAS6. These machines usually have much more memory than end-user hardware. You might want to contact @xoriole since he can setup an account for you, but I wouldn't expect much until after the holiday season šŸ‘

awrgold commented 1 year ago

It's definitely a memory issue, I have some credits laying around somewhere if I want to spin up an instance, but I'm also seeing some interesting preliminary results so I want to see if I can't fix it first.

On Sun, Dec 25, 2022, 01:01 Martijn de Vos @.***> wrote:

It might be a memory issue. I suggest then using our dedicated infrastructure, for example, the DAS6. These machines usually have much more memory than end-user hardware. You might want to contact @xoriole https://github.com/xoriole since he can setup an account for you, but I wouldn't expect much until after the holiday season šŸ‘

ā€” Reply to this email directly, view it on GitHub https://github.com/Tribler/tribler/issues/5313#issuecomment-1364641372, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFXMZWPL5EOI6DXVEGPGLTWO75MJANCNFSM4M3KKCZQ . You are receiving this because you were mentioned.Message ID: @.***>

synctext commented 1 year ago

X-mas day gift: slowdown issue! (memory or something else)

Well, this is then a forced moment to clean your code. Using DAS6 would not fix your bug at this late stage. Remember me pushing for straight 45Ā° graphs? This is why. Simple, understandable, and debug-able.

My other advise: start conforming what part of the gossip works. Keep it simple, you're an AI expert. Our natural world is gossip load balancing debugging. Simplify the clicklog sharing to entire Clicklog to 1 random node per round. Fixed? Node discovery graph, for each node in the system: how many incoming messages and how many unique discovered other nodes. Usually you see patterns with hundreds of lines (hundreds nodes).

synctext commented 1 year ago

X-mas day gift: slowdown issue! (memory or something else)

Well, this is then a forced moment to clean your code. Using DAS6 would not fix your bug at this late stage. Remember me pushing for straight 45Ā° graphs? This is why. Simple, understandable, and debug-able.

My other advise: start conforming what part of the gossip works. No exponential growth. Keep it simple, you're an AI expert. Our natural world is gossip load balancing debugging. Node discovery graph, for each node in the system: how many incoming messages and how many unique discovered other nodes. Usually you see patterns with hundreds of lines (hundreds nodes).

awrgold commented 1 year ago

As an extra Christmas gift my PC decided to forcibly update windows and restarted as my simulation was approaching 70% finished, so that is great.

On Sun, Dec 25, 2022, 02:17 Johan Pouwelse @.***> wrote:

X-mas day gift: slowdown issue! (memory or something else)

Well, this is then a forced moment to clean your code. Using DAS6 would not fix your bug at this late stage. Remember me pushing for straight 45Ā° graphs? This is why. Simple, understandable, and debug-able.

My other advise: start conforming what part of the gossip works. No exponential growth. Keep it simple, you're an AI expert. Our natural world is gossip load balancing debugging. Node discovery graph, for each node in the system: how many incoming messages and how many unique discovered other nodes. Usually you see patterns with hundreds of lines (hundreds nodes).

ā€” Reply to this email directly, view it on GitHub https://github.com/Tribler/tribler/issues/5313#issuecomment-1364650705, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFXMZROV6VM67U732CH4HTWPAGJJANCNFSM4M3KKCZQ . You are receiving this because you were mentioned.Message ID: @.***>

awrgold commented 1 year ago

image

Some interesting results, when the normalization constant F=1.0 (when nodes do not cluster themselves based on similarity to other nodes in their clicklog) we see that all nodes converge to a very similar score.

I don't know what the behavior is like over 10000 queries since my PC restarted, but the above is what I had from a smaller simulation.

image

Then, if F=0.0 (where nodes do not consider the clicks of any nodes they're aware of unless those nodes have performed similar query-click patterns) we see less convergence, which honestly is the opposite result of what I would have expected...

awrgold commented 1 year ago

However, I'm not sure what you think of the color scheme above. I'm using Blue to indicate nodes that have not been infected by clicklog entries from malicious nodes, Red to indicate spam/sybil nodes, and Green to indicate non-malicious nodes that have entries from malicious nodes in their local clicklogs.

I wonder if this is the most clear way to visualize this metric...

awrgold commented 1 year ago

Alright, so beyond a few optimizations I had a ton of fun learning about string interning in Python. I was doing a whole lot of string comparisons when checking if clicklog items contained the query string, and was doing it inefficiently. Hopefully this leads to a speedup - had to refactor a lot of code.

synctext commented 1 year ago

good to hear! Please focus on thesis quality text for our next meeting tomorrow. Explain (security) Sybil/pollution experiments.

awrgold commented 1 year ago

Of course - problem is, I'm basically only waiting for results to discuss at this point, so this was the major bottleneck for me.

awrgold commented 1 year ago

@synctext How does this sound as a new first paragraph of Problem Description:

"Security within the domain of decentralized machine learning remains an unsolved problem. There exist numerous additional constraints in decentralized networks that traditional machine learning models need not be concerned with. Trustless, anonymous networks are rife with malevolent usership, and the task of identity verification in such networks also remains unsolved (TODO: SOURCES). Adding an additional layer of complexity, many p2p networks are built upon open-source software, affording any would-be adversary direct insight into potential attack vectors. As such, machine learning models engineered for public p2p networks require exceptional attention to detail across all facets of their design. These constraints disqualify any supervised models from the outset as they violate the trustless nature of p2p networks. Either the engineers of such supervised models must be trusted to train and validate the model, or the network participants must provide training data themselves, thereby introducing a critical vulnerability into the training process. Creating a search engine for a p2p domain that requires no training yet can converge towards an optimal performance as if an error rate is being minimized in a supervised model would constitute a major development in p2p applications."

awrgold commented 1 year ago

It might be a memory issue. I suggest then using our dedicated infrastructure, for example, the DAS6. These machines usually have much more memory than end-user hardware. You might want to contact @xoriole since he can setup an account for you, but I wouldn't expect much until after the holiday season šŸ‘

FWIW I learned a lot about Python object creation and destruction in the past week or so. Also, string interning is kind of a big deal when you're dealing with huge numbers of string comparisons. Who'da thunk??

I have a method where I find matches in clicklog entries by comparing strings, and I was using .contains() instead of string1 == string2. Even better would be a thorough refactoring and careful code review to use string1 is string2 but I figured considering my 'distributed simulation' that's a step too far.

Sped things up considerably, by a factor of about 30-50x. Comparisons are definitely O(1) now.

awrgold commented 1 year ago

image

Interesting preliminary push-pull results. Small sim where nodes "pull" clicklog updates from 1st or 2nd degree neighbors, but spam nodes "push" to whatever nodes they're aware of.

I was thinking of having 50% of the nodes accept "pushed" gossip while still pulling, and the other 50% will only pull. Could illustrate the difference, since it does seem that most get infected eventually.

awrgold commented 1 year ago

There is something I can't quite wrap my head around:

image

The performance plateaus pre-attack, and then continues on a downward trajectory post-attack. This happens in every adversarial simulation, to some degree.

I am not taking adversarial nodes' query rankings into consideration when measuring a node's ranking performance, so introducing attackers does not directly influence the score - only the gossip of poisoned clicklog entries over time.

The only thing I can think of is that adversarial nodes will gossip their infected clicklogs which will eventually propagate throughout the network, and over time those poisoned clicklog entries will affect ranking scores. As you can see, post-attack there is also a plateau that occurs until a node gets lucky enough to receive an update via gossip that positively influences their performance.

awrgold commented 1 year ago

@synctext I know it's 11:59 on the doomsday clock, but I've been having some very intriguing results with the push-pull architecture such that I feel like it's actually worth integrating as a core function of G-Rank, at least within the context of the simulations.

Nodes that are accepting of pushed gossip messages are converging faster towards optimality pre-attack, and then very quickly start performing much worse, whereas those that are pull-only nodes are converging at a slower rate but are largely unaffected by malicious gossip.

A major reason why I'm thinking this is admittedly due to the fact that I'm dealing with some serious slowdowns in simulations that I cannot explain - it's not a memory issue actually as I've been doing quite a bit of profiling and have done a lot of optimizing. It just has a really hard time calculating similarities for a large number of nodes, which of course become aware of each other quickly during spam attacks. I'm still trying to figure this out.

A preliminary side experiment I've been running is the 50% push, 50% pull experiment with a few modifications:

I've also introduced some statistical noise into the rankings to avoid the plateau seen above: there's a 50% chance per query for 2 randomly selected items in the list to swap places in the ranking. This does seem to help a bit with improving ranking over time.

Push-pull simulations are really fast. I can do 100k queries in the same time as it takes the other non-push/pull simulations to do 5k queries.

Extensive profiling has not shown me why this is happening. %prun shows of course that there's one specific function call taking up 99% of the time, which is the == comparison between the current query term and past query terms in the clicklog. I've even tried speeding this up by codifying the query terms into integer values, but even so this is not helping. Something is going on, and since I'm about to disappear for 5 days I'm going to run all of my experiments under a push-pull gossip architecture, with of course the various attacks for comparison.

I'll have my laptop on my ski trip but I won't be working that much during the trip, just so I can check in on the simulations and maybe do some writing in the car.

Let me know what you think?

awrgold commented 1 year ago

image

The results from a targeted attack were pretty interesting though

synctext commented 1 year ago

quick remarks:

awrgold commented 1 year ago

Edited for a few necessary clarifications:

1) 10s ATTACK: You say "10s attack" but I have no idea what you mean by this, I must have misheard you. 10 seconds to kill the network? The simulation is in discrete time steps. Can you elaborate?

2) PUSH vs. PULL: Right now the number of permutations of experiments is at 8 - four experiments with two parameter changes (F=0 and F=1) but if we add Push vs. Pull across all experiments, we're now at 16 experiments.

I would assume, then, that the Push vs. Pull Experiment is not a comparison across all attack types - instead, we choose one attack (probably Targeted Sybil) and perform a Pull-based experiment, and then compare the results against the regular Targeted Sybil attack, which is under a Push scheme.

Currently, the writing is as follows:

With a pull architecture, peers are more autonomous and decide individually the speed of incoming information, if they trust another peer, or may randomly sample from discovered peers. Malicious nodes in this experiment try to push two messages when gossiping. With the pull architecture, only one message per gossip phase is accepted.

As such, the fact that only malicious nodes push 2 messages whereas benign nodes push only 1 is somewhat confusing to me, as it adds extra dimensions to the experiment and also means that malicious nodes have undermined the source code to modify the gossip scheme, which in itself is a pretty major attack vector. I feel like only one of the following should be true:

Does this make sense?

3) You say:

attacker passively join (at 25%), then act bad (75% in)

Do we want to change this for all attacks? Right now they're joining at 25% and begin attacking immediately.

4) EPIC ATTACK: We used to have the "Persistent Sybil" attack, which represented of course a persistent threat. You say:

Epic attack scenario. nice network. Suddenly 3x as many Sybil appear all at once (e.g. 75% Sybils).

Do we change this Persistent Sybil attack scenario into an Epic Attack scenario? Or is "Epic Attack" a new kind of attack that we're also introducing?

synctext commented 1 year ago

10s ATTACK: You say "10s attack" but I have no idea what you mean by this, I must have misheard you. 10 seconds to kill the network?

It shows also within the latest figure that after 10 'rounds' of attacks the whole network is polluted and essentially destroyed https://github.com/Tribler/tribler/issues/5313#issuecomment-1380576305 Sidenote: by making 1 round exactly 1 second, your results are easier to understand and interpret. Everything happens very fast, the idea of peer-to-peer machine learning is viable, but 1 attacker can destroy everything in 10 seconds. Great conclusion.

but if we add Push vs. Pull across all experiments, we're now at 16 experiments.

Always the smart secure option: pull. Just 1 experiment subsection you can elaborate on the push architecture and the consequences.

instead, we choose one attack (probably Targeted Sybil) and perform a Pull-based experiment, and then compare the results against the regular Targeted Sybil attack, which is under a Push scheme.

that sounds like a good storyline

As such, the fact that only malicious nodes push 2 messages whereas benign nodes push only 1 is somewhat confusing to me, as it adds extra dimensions to the experiment and also means that malicious nodes have undermined the source code to modify the gossip scheme,

In any distributed systems each node can deviate in a byzantine manner from the protocol. So this experiment (could) explore the rate control that is done in push versus pull. With push architecture an attacker can send messages twice as fast. It is very understandable that a deliberately simple simulator only support fixed message speeds. Feel free to explain that within your text and 'hack' that by letting honest nodes send empty messages with 50% probability. "For reasons of keeping our code simple, reliable, and correct we use a single global message speed for both attackers and honest peers. Attackers use each message to attack, but honest peers obtain a lower effective messaging speed. With 50% probability they send an empty message."

A specific experiment is performed where malicious nodes push 2 messages and benign nodes push only 1, but push vs. pull only occurs in a single experiment and all other experiments ignore the pull architecture and gossip as normal.

Yes! Except, that pull should be the default. Push has bad security, you can easily flood networks with 1 GBit per second of spam.

Do we want to change this for all attacks? Right now they're joining at 25% and begin attacking immediately.

All figures. My intuition says that your results will be easier to interpret and understand. Avoiding two things happening at once. But again, its your thesis to write. I'm just paid to help and advise :-)

Do we change this Persistent Sybil attack scenario into an Epic Attack scenario?

Yes, that sounds like a more interesting experiment. 100 honest, 300 attackers or so.

awrgold commented 1 year ago

1) šŸ‘ 2) šŸ‘ 3) šŸ‘ 4) I understand that we can have byzantine processes, especially within the context of p2p networks, I just assumed we were focusing purely on a specific type of threat (e.g. sybil attackers), and that discussing the potential for byzantine faults and other sorts of attacks was outside the scope of the paper. Your explanation makes sense though, I'll integrate that. 5) I'll make pull the default across all simulations, then? 6) I'll have them join at 25% and attack at 75%. This is just different than what we had been doing, so it felt like a last minute major change but honestly it's not that big of a deal. 7) šŸ‘

awrgold commented 1 year ago

Reminder, though, that purely pull-based gossip means that nodes will never become more aware of the network beyond the nodes that they're aware of, so we'd need some kind of propagation method. I think this is why I was doing the 50% push, 50% pull simulation - it allowed for nodes to become more aware of the network, at the risk of being more susceptible to attack.

Just so that we're 100% clear: we're using pull-based gossip as the default setting for ALL simulations (except the push vs. pull comparison experiment) which means that I'll need to re-write the gossip section of the thesis. Correct? @synctext

Also, I realized that with pull-based gossip, the attackers will perform their attack but then will need to lie in wait until someone requests an update from them, which may dramatically slow them down. If we have attackers push, then that means all adversarial simulations involve a push architecture. How do we handle this?

synctext commented 1 year ago

Reviewing latest thesis .pdf

awrgold commented 1 year ago

Found another problem with my data from the experiments, will have to run them again because they were truncating values for some reason. Will upload graphics ASAP.

Update: Yet another bug found, only nodes that received gossip REQUESTS were updating during gossip, and not the recipient of the gossip itself, and therefore could not discover newly bootstrapped nodes, so attacks never happened. Fixed, running again.

awrgold commented 1 year ago

Ah, I forgot a question that I think is somewhat important: We are doing bootstrap at 25%, attack at 75% - however, this leaves little time to see the longer-term effects of the attack. What if we did bootstrap at 25%, attack at 50%?

awrgold commented 1 year ago

image

Baseline simulation does plateau relatively quickly, but some nodes start to move downwards over time. Initial guess is that this is either:

  1. due to the random noise inserted into the rankings, or
  2. due to luck by randomly sampling from a node with better rankings for a specific term

Not that I will prioritize this, but the baseline sim is super fast so I can do a sim 10x as long (or increase the number of request messages) to see what the culprit is.

More updates coming after fixing the issue mentioned here: https://github.com/Tribler/tribler/issues/5313#issuecomment-1397096339

awrgold commented 1 year ago

image

PUSH: Targeted Sybil Attack with Push gossip scheme. Within 10 rounds of the attack, the entire network converges and plateaus.

I'm trying to think of a better way to visualize the infected nodes alongside the sybil nodes, but I don't want to offset them so you can see that the green infected nodes are behind the sybil nodes. I also don't want the sybil nodes obscured by the infected nodes. I guess we can just describe in the figure text that sybil nodes overlap the infected nodes.

synctext commented 1 year ago

Literature Review and comments

awrgold commented 1 year ago

Ars Technica: Massive Yandex code leak reveals Russian search engineā€™s ranking factors. https://arstechnica.com/information-technology/2023/01/massive-yandex-code-leak-reveals-russian-search-engines-ranking-factors/

Material for future students?

synctext commented 1 year ago

:rocket: GRADUATED :rocket: arXiv: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network TUDelft repo: of a decentralised search engine to De-Google The Internet using donated smartphones CODE: https://github.com/awrgold/G-Rank (Jupyter Notebook file)

synctext commented 1 year ago

the above master thesis is also available for 5 students bsc students final thesis project in Q4.

Create WEB3 search engine from scratch - De-Google The Internet

Numerous initiatives have tried to compete with the success of Google, none succeeded. Using the latest scientific insights you will design a fully decentralised alternative to the Google search engine. The Delft Blockchain Lab (DBL) is TU Delftā€™s initiative for research, education, and training in blockchain technology and trust in the internet. Our research heavily focusses on improving the efficiency of blockchains, self-sovereign identities and blockchain-powered marketplaces. Our key blockchain project is Tribler. Tribler is a peer-to-peer file-sharing application where users can share and download digital material, without requirement for a central operator. The main goal of Tribler is to provide an decentralized alternative for YouTube using Bittorrent. Efforts by several generation of students have resulted in a mobile implementation of Tribler, which we plan on expanding and releasing. As background, see dozens of articles on TorrentFreak.com about Tribler. Mobile media personalised search, based on your watch history, is a key requirement for a mobile version of Tribler. It enables users to discover new, interesting content and helps media creators with reaching the right audience for their content. A search engine for videos, however, is a non-trivial problem. Learn-to-rank algorithms present a wealth of available media in an optimal ordering for the user to click. Decentralisation of such search engine algorithms has proven to make a difficult problem even harder. Ranked search engine results require the full list of available videos to be available locally. However, this requirement becomes unfeasible on mobile phones if there are million of data elements, given the limited storage capacities and battery constraints of mobile phones. Simply put, TikTok has too many videos to search locally. Off-loading the computations to other users raises privacy concerns since a user likely does not want to reveal its watch history to other (untrusted) users. The use of privacy-preserving, distributed machine learning in Tribler should provide a solution for media search on mobile phones.

Literature and background

First, read the Wikipedia entry on decentralised search engines. Your key literature for this task is the Delft paper which currently defines the state-of-the-art: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network. A more introductory blog post on learn-to-rank and dataset details. Early work from 2005 by Delft provides a simple and realistic experimental approach to the problem of media discovery and recommendation, you are required to understand the basic algorithm of semantic clustering (e.g. taste buddies). A paper from 2012 proposes a model where mobile phones use gossip learning to compute a linear model, without revealing local models and without storing the full data set. Another classic attempt from 2003 onwards is the decentralised YaCy search engine with a web crawler, complex hashing, and reverse word index. Finally, your search engine will be implemented inside the Internet-deployed alternative to Big Tech platforms called the SuperApp. This Web3 open source software is written in Kotlin. This Web3 infrastructure is as decentralised as Bitcoin and Bittorrent. Instead of videos, your project is focused on the simpler case of music. Required background reading on the Spotify alternative: Fairness and Freedom for Artists: Towards a Robot Economy for the Music Industry. To keep this project viable, you can ignore security issues such as the Sybil attack and also privacy issues resulting from sharing your ClickLog with strangers on The Internet. This is left as future research, your focus is on an initial proof-of-principle.

Problem Description

PageRank is the defining centralised algorithm of Google. Understand existing algorithms and architectures for decentralised search engines. Understand the state-of-the-art algorithm within this area: G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network. Contribute to the first hands-on proof-of-principle implementation of G-Rank.

Five research questions

See the initial placeholder implementation for keyword search already present. The following 5 sub-questions will be assigned to a single student each:

Dataset engineering. How to design and implement a user model which issues queries and select one entry from presented search engine results. Desired outcome: one search that your model issues will be for the unique words "Punk classical Jazz". This deliberately strange query must select one of the matching musical pieces marked with all these three tags. Required scientific literature for reading: "A user browsing model to predict search engine click data from past observations". For the experimental part, use the existing scrape facility to duplicate the Creative commons music from Pandacd. Critical input for the learn-to-rank algorithms is the popularity of each song or artist. Enhance your model with an existing dataset with 1.4 million artists along with their popularity estimates.

Centroids-based clustering. Design and implement semantic clustering of the dataset. Based on the metadata, tags, and popularity you enhance the dataset and user model with Euclidean, Minkowski, Manhattan, or other distance measuring mechanisms. Required background reading. Based on this work you will generate representative taste profiles (e.g. ClickLogs)

Decentralised taste exchange. How to design and implement an overlay network to disseminate taste. You task is to create a detailed design of the ClickLog exchanger within the G-Rank algorithm. As a starting point within the literature, read "Random Walks on the Click Graph"

Accurate Learn-to-Rank. How to design and implement unsupervised learn-to-rank heuristics with decent response time, at the cost of minor precision and recall. You results need to appear within 2 seconds, therefore providing reasonable computation. Background literature: "Real time search on the web: Queries, topics, and economic value"

Real-time learn to rank. How to design and implement unsupervised learn-to-rank heuristics with fast response time, at the cost of significant precision and recall. You results need to appear within 100ms time, therefore very little computation can be performed and pre-calculated indexing techniques must be used. Background literature: "Real time search on the web: Queries, topics, and economic value" Together these 5 research question lead to a complete design of a fully distributed search engine.

Note: this is a challenging assignment that requires thorough understanding of specific scientific literature and ability to engineer algorithms. Recommended for Honor Students only.

synctext commented 1 year ago

state of the art by Google https://ai.googleblog.com/2023/06/retrieval-augmented-visual-language-pre.html

'''A naĆÆve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and theĀ top-kĀ retrieved memory values by concatenating all their tokens together and feeding them into aĀ transformer encoder-decoderĀ pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens timesĀ kĀ forĀ self-attention. Therefore, we propose to use theĀ Perceiver architectureĀ to encode and compress knowledge items." https://www.deepmind.com/publications/perceiver-general-perception-with-iterative-attention