ethersphere / swarm

Swarm: Censorship resistant storage and communication infrastructure for a truly sovereign digital society
https://swarm.ethereum.org/
GNU Lesser General Public License v3.0
488 stars 110 forks source link

localstore/db how to insert items in GC index #1031

Open adamschmideg opened 5 years ago

adamschmideg commented 5 years ago

Related issues:

  1. Garbage Collection and Syncing and Proof of Burn - https://github.com/ethersphere/go-ethereum/issues/1017 ; https://hackmd.io/t-OQFK3mTsGfrpLCqDrdlw
  2. Spam protection of Swarm through modifying GC rules - https://github.com/ethersphere/go-ethereum/issues/757
  3. Implement DB store API - https://github.com/ethersphere/go-ethereum/issues/990 - https://hackmd.io/ffBjSu8RTyKikvRO7bYrzA?both#operations

anteva: It seems to me that this is the main entrypoint issue for the new GC effort.

Whoever works on this and has the most knowledge on this topic: please add the following sections to this issue:

Without such a history it will be very difficult to understand any of the decisions made in this effort a few months from now.

cc @zelig @nagydani @homotopycolimit @janos

cobordism commented 5 years ago

Honestly, I have no idea what "add interval to accesstimestamp" means.

@janos and I were discussing the following idea the other day:

For each chunk in the database we store 2 extra hashes; something like nexthash and prehash - effectively giving us a double linked list.

We also maintain a list of insertion points [quantile] -> [hash].

When a chunk is added to the database, and we want to add it at the 80% mark for example, we can look up the hash of the chunk that is currently sitting at that insertion point, we add our new chunk, and we update the insertion point hash as well as the prevhash and nexthash of the neighbouring chunks.

When a chunk is served to a retrieval request, it can be inserted at the top of the queue and the [head] pointer will point to it. The prevhash and nexthash of the surrounding chunks are updated to reflect that the chunks has been removed from its previous location.

For bootstrapping, all quantiles (except [head] of course) will map to the end of the chunk queue. This would happen until the total number of chunks has reached 50% of the total storage size. At that point the 50% marker can be set, while the others (60%, 70%... whatever) continue pointing to the last chunk.

GC happens from the tail of the queue, proceeding to prevhash every time.

One main benefit of this idea is that we can maintain multiple insertion points, and we don't have to update every chunk's entry when the queue changes because position-in-the-queue is a property and not an attribute.

zelig commented 5 years ago

@homotopycolimit
the trouble with this idea was that it necessitates a global lock on chunk insertion and thus imposes a performance penalty. Maybe @janos can explain in more detail.

"add interval to accesstimestamp" is meant to refer to @janos solution that at each retrieval access an interval is added to the last accesstimestamp, thereby boosting its rank (add more protection against GC). The actual interval added can depend on proximity, current accesstimestamp and mode of access (sync/upload/retrieval for peer/retrieval for self) With this method exponential aging is implemented. parameters can be learnt by fitting a model to expectations on GC order given real world usecases. The problem is i am not sure we can formulate expectations of the "right" GC order.

cobordism commented 5 years ago

Ah ok. What we discussed the other day was strictly the quantile insertion, and not the timestamp ranking.

cobordism commented 5 years ago

the trouble with this idea was that it necessitates a global lock on chunk insertion and thus imposes a performance penalty.

why global?

janos commented 5 years ago

why global?

@homotopycolimit relations in double linked list require a lock to ensure that no other parallel goroutine is changing the list. If we want to insert a hash at some percentile, we would need to lock two hashes in between which the new one is inserted, but also their neighbours, as some other chunk may need to be inserted among them at the same time, and so on, which at the end results in the global lock.

cobordism commented 5 years ago

it still feels like a local lock to me :)

but I trust your verdict

janos commented 5 years ago

Descriptions of different Garabge collection implementations

Here are summarized garbage collection suggestions based on meetings with @zelig, @homotopycolimit and @nagydani, and https://hackmd.io/t-OQFK3mTsGfrpLCqDrdlw document.

Note: Chunks are added to garbage collection index only after they have been synced.

Suggestion 1: Quantiles with accessTimestamp index

GC index encoding would be accessTimestamp|hash->nil. This is the same index as the one for the current algorithm, where every time a chunk is accessed it is put on the top of the queue. The addition is to keep track which keys are at specific quantiles. Every new chunk will be added to the the queue at some quantile depending on the origin of the chunk (uploaded, synced, retrieved), and with every new access, chunk will be moved to the top of the queue. Adding a chunk to the quantile would require to identify the key where the quantile is, then to increment that key's accessTimestamp by one and use it as the accessTimestamp for the newly added chunk. That accessTimestamp would not be the access timestamp, but the timestamp at which the chunk would be inserted at the desired quantile. In order to keep exact track of quantiles, a global lock is required.

Pros

Simple algorithm. Fast garbage collection.

Cons

Requires global lock. Additional chunk index for tracking chunk origins hash->origin is required.

Suggestion 2: Quantiles in linked list

GC index structure would be defined hash->leftHash|rightHash, where key is only the chunk hash and value contains hashes of its left and right neighbours in double linked list structure. Additional fields would be required to keep track of at least head and tail of the list, and various quantile position hashes. Every new chunk will be added to the the list at some quantile depending on the origin of the chunk (uploaded, synced, retrieved), and with every new access, chunk will be moved to the start. Inserting a chunk to the list would require identifying the hash at a specific quantile and setting appropriate index values for that hash, the next one and the newly added hash, in a way that they construct unbroken double linked list. Garbage collection would start from the end of the list, deleting the last chunk and identifying the next one from its value as the neighbour chunk.

Pros

Very simple algorithm to understand and test.

Cons

Any operations on double linked list and for keeping track for quantiles a global lock is required. Garbage collection iteration is slow and expensive as it requires to use ldbstore get instead of ldbstore iterator. Tracking quantiles require additional fields. Adding a new quantile or potential calibration is an expensive operation as it requires walking through the whole list with ldbstore get. Additional chunk index for tracking chunk origins hash->origin is required.

Suggestion 3: Ranking function

GC index encoding would be rank+accessTimestamp|hash->nil, where rank is calculated by the function that accepts parameters

and added as integer to the accessTimestamp. Garbage collection would just remove the chunks from the tail of the queue as chunks will be ordered with least important at the end. The rank is actually increasing accessTimestamp by some value calculated by this parameters. The increment is greater as the popularity raises, for more proximate chunks and by chunk origin. Chunks with greater rank will be placed higher in the queue and later garbage collected. The returned value from the rank function would place the chunk at the same time in the future as it would place a chunk that is accessed for the first time at the same proximity and with the same origin. The problem is that the ranking function can be derived only by guesses and empirical data, based on the dynamics of chunks that we predict. Every parameter may have a linear relation to the rank or non linear. With linear relation we are risking that some chunks may be placed in the queue so high that even when not accessed for a long time, they are not garbage collected. So a non linear dependance would be better for popularity parameter, for example a logarithmic. Every parameter needs to have one or more constants that need to define their relation to accessTimestamp. The first order linear constant is in the same unit as accessTimestamp is, nanoseconds. If more proximate chunks need to be prioritized in the queue, the most simple way would be to add more time to the result of the ranking function. For example, the constant for proximity would be 10 minutes (very arbitrary guess) which would be multiplied by the proximity level. If the proximity value raises for more closer chunks, then the ranking function would multiply proximity with the constant of 10 minutes. For popularity, it would multiply logarithm of chunk accesses number with its constant which may be 1h, for example. Chunk origin enumeration would prioritize ranking by adding different fixed constants based on enum value. The rank function can be as simple as linear function of independent parameters rank(pop, prox, orig) = c1*pop + c2*prox + c2*orig, or more complex non linear rank(pop, prox, orig) = c1*log(pop) + c2*prox + c2*orig, to a very complex function of non-orthogonal parameters rang(pop, prox, orig) = f1(pop, prox) + f2(prox, orig) + f3(pop, orig).

Pros

Chunks in garbage collection index are completely independent and can be inserted or moved in queue without a global lock. Iteration on garbage collection index is the least expensive operation for iterating over leveldb key/value pairs, as keys are already sorted on the database level.

Cons

Constants and the formula for the ranking function can be defined only by empirical data and our assumptions about chunk prioritization in time, which may be very subjective, making a consensus on it very difficult, or impossible to achieve. At least two additional indexes need to be added. One for popularity counters hash->popularity for every chunk and one for the chunk origin hash->origin. These indexes are needed for rank calculation.

Suggestion 4: Self adjusting quantiles or ranking function

This would be an alternative to the previous suggestions where quantiles or ranking function constants would be calibrated at runtime. This would be a very advanced approach, but requires discussion and formalization.

cobordism commented 5 years ago

great writeup @janos. but can you explain to me what this type of notation means; accessTimestamp|hash->nil ?

janos commented 5 years ago

Thanks @homotopycolimit. Yes, I missed the explanation about this notation, which represents encoding of key/value pairs in the database. Key and value are separated by ->, and the pipe | represents concatenation of two variables encoded as byte slices. I also used the + sign which is the addition of two integers with result encoded as byte slice. The nil represents an empty value, no data stored.

zelig commented 5 years ago

nice job @janos one thing. I dont think that popularity needs to be tracked. this approach is that your accessTimestamp already has all the information about the combo of the relevant factors. One access would just boost the time depending on its current timestamp.

One approach could be: when accessing a chunk c for retrieval we update its accesstimestamp by projecting it ahead 1/k on the way to some .

at'(c) += (now+max_future-at(c))/k

max_future parameter calibrates the absolute duration after last access that an accesssed chunk is considered still more valuable than a newly arrived chunk

k calibrates the relative decay of an extra instance of access

we define at(c) := now for new chunks

if distance is meant to be important, new chunks could behave as if they were accessed n times where n=2^po right at insertion time

WDYT?

janos commented 5 years ago

@zelig I like the formula for accessTimestamp and also the distance handling only at the insertion. In the similar way as distance is included at insertion, the origin could be as well, so we would not need to add two more indexes. 👍 We would need to figure out what values we want for calibration constants.

zelig commented 5 years ago

how about max_future := 10^16 nanoseconds which is rougly 4 months. and let k := 10.

janos commented 5 years ago

That are good sane starting values.

nagydani commented 5 years ago

Suggestion 5: Approximating linked list quantiles with timestamps

The access timestamps at the quantiles as described in Suggestion 1 can be reasonably well approximated by looking at a random sample of chunks and their access timestamps. Thus, newly inserted chunks can have their timestamps set accordingly, putting them very near the desired quantile. Moreover, unlike Suggestion 1, these timestamps need not be tracked and updated with every change in the database, just regularly enough that they are still reasonable. Thus, we are approximating the behavior of that expected from Suggestion 2 at a much lower cost.

Pros

The behavior is close to that of Suggestion 2. The costs are similar to those of Suggestion 1.

Cons

Random sampling of the chunk database might be tricky. It might require adding a random field to the database and querying the database using that field, to ensure a fair sampling.

nagydani commented 5 years ago

Further comments (resulting from discussion with @zelig and @homotopycolimit ):

Insertion into the database happen from four sources:

The latter two happen on any node, the former two only on gateways and end-user nodes. There are two ways at looking at this issue: the overall health of the network (i.e. chunks being retrievable) and the local interests of the node operator. From the first perspective, the desirable behavior is that chunks close to the node's address (esp. those within the "radius of responsibility") are not GC'd. From the latter perspective, a desirable strategy would be one close to a Nash equilibrium, i.e. there are no obvious ways in which one can profit more, if everyone else follows the same strategy.

Neither huge uploads (think of an attacker pushing random stuff into Swarm) nor huge downloads (think of a "Netflix binge") should flush the entire local store of a node. The nice thing about Suggestion 2 is that it guarantees that a certain fraction of the store is allocated for the most profitable chunks and is guaranteed protection.

Synced chunks and especially uploads should not be GC'd before they are synced to nodes closer to their destination.

The average (but not necessarily the actual) order number of a chunk in an ordering by access time is the same as its order number by access rate, assuming that chunks are accessed as a Poisson process (i.e. memoryless, random events). Without knowing anything about a chunk, we can estimate its future rate of access by its distance from the node's address. If a chunk's address falls within the node's most proximate neighborhood, it should be inserted in the middle of the GC queue, as assuming (for a moment) no caching all other chunks are also there and all the cached chunks got ahead of other chunks because of their additional popularity in comparison to the most proximate chunks. Each bin's distance from the most proximate neighborhood should half the distance from the end of the GC queue.

cobordism commented 5 years ago

I just want to note here that we can do the quantile-insertion with the double-linked list without a global lock. Whenever you want to insert or remove a chunk, you must lock the two neighbours only, not more. So the question become, is it easier to get a lock on two or three chunks or is it easier to do random sampling to estimate insertion points.

(Example: to remove a chunk, lock the chunk and it's two neighbours, update the 'next-chunk' link in the previous chunk and the 'previous-chunk' link in the next chunk, and maybe update the quantile pointer)

zelig commented 5 years ago

@homotopycolimit I am not so sure. In order to know who the neighbours are that you want to lock, you must make a db lookup and lock it with the same operations, which is essentially a global lock plus and extra lookup.

Dani's idea of approximating quantiles is a good one.

@janos let's decide on an implementation and start working on this.

cobordism commented 5 years ago

please explain to me how locking down 3 (or 5) chunks to be edited simultaneously requires a global lock of all chunks? I don't get it.

zelig commented 5 years ago

just imagine you need to prevent modification of the same item simultaneously. if you lock hash h_i, lookup its two neighbouring hashes h_{i-1}, h_{i+1},then lock their hash. This interleaving pattern causes a deadlock if subsequent items are modified. So you end up having to use a lock to prevent interleaving, ie global

cobordism commented 5 years ago

I don't see the deadlock. If I want to move chunk 800 to position 400 I have to get a lock on 399 and 400 for the insertion, 799 and 801 for the removal and 800 itself for a total of 5 locked chunks.

To prevent a deadlock we need to be able to implement a function: get-a-lock-on-all-5-chunks-or-release-and-try-again. This should be possible without a global lock. no? A global lock is only needed of we are changing state globally, but we are clearly only changing things locally.

what am I missing?

cobordism commented 5 years ago

For the record, I'm not wedded to the idea of the linked list, I just want to understand why it can't work before we do a random sampling technique to approximate with some effort what he linked list gives us easily.

I am certainly not an engineer, but here is my naive idea: Race conditions (when requesting locks) happen when neighbouring chunks are to undergo simultaneous edits... say both chunk i and chunk i+1 have been requested and thus should be moved up the queue. Assuming that the size of the db is large compared to the size of 'chunks accessed per second', this is a rare occurrence. So when we detect that we cannot get a lock on all the chunks we want, release all and wait a random number of between 0 and 100 milliseconds and then try again. Chances are that one of the go routines will get the needed lock before the other one tries again and the deadlock situation untangles. .... after all, this is how proof-of-work chooses blocks from competing miners - by delaying approximately 10 minutes for either to try again. (ok it's a weak analogy.. sorry, but do you understand what I'm trying to convey?)

What is the likelihood of a permanent deadlock under these conditions?

...or maybe, instead of the random delay, the global lock is the fallback method that would only ever be invoked in this situation.

cobordism commented 5 years ago

Every chunk that is encountered is added to the top of the queue. Garbage collection happens at the bottom of the queue.

We make no distinction between chunks that are encountered due to syncing, due to a retrieval request originating elsewhere, due to local data consumption on the node ... This creates multiple problems. One example, if a person running a node decides to binge watch a whole season of some TV show, then every one of those chunks will be added to the top of the queue over a relatively short period of time. This pushes all other data to the bottom of the queue and ultimately to the garbage. The node in question now is no longer able to serve chunks that it is notionally responsible for (those to which it is most proximate). The node is unhealthy as far as the network is concerned. The node also has caches a lot of chunks that are unlikely to be profitable requested from it by any other node. It is unhealthy from an economic perspective. Another example, if I upload 100GB to my local node, even if I manage to wait for every chunk to sync before I garbage collect, I will lose all the previously encountered chunks in my most proximate bin and I will be holding chunks from that 100GB collection that are so far from my node ID that I am unlikely to ever be able to profitable sell them. Thus I end up with an unhealthy node just as in the previous example.

The first thing a new GC queue will need is the basic infrastructure to handle chunks differently depending on context, to insert chunks at different points on a GC queue or to maintain multiple queues. It is after such a basic infrastructure is in place that we can fine tune how we deal with the different cases, such as chunks encountered

  1. during syncing (remote upload)
  2. During upload of content to the local node (local upload)
  3. due to a retrieval request originating at another node (remote download)
  4. due to a local retrieval request (local download)

and possibly

  1. due to a request from a Swarm light client [and how to deal with this will depend on the implementation of the light protocol]
  2. retrieval requests served due to operation of a swarm gateway.

Each of these might require different operations of GC.

The ideas discussed above do not delve too much on where in the chunk queue each chunk should be inserted, but mostly focus on how to maintain a chunk queue with multiple insertion points in the first place. Two main threads of the discussions so far are

  1. whether to maintain a full, explicit, secondary ordering of chunks (perhaps implemented as a double linked list in the db) so that we can remove and insert chunks at specific points in the list, OR if we can approximate that behaviour using random sampling to get good-enough behaviour at much lower computational cost. OR if we can do without the linked list, and instead award each chunk with a 'score' (one that is updated whenever the chunk is accessed), such that the lowest scoring chunks are the ones to be deleted.

  2. whether to maintain a separate GC queue for local node downloaded and uploaded data - allowing a user to 'pin' data locally, to cache whatever data they want to use locally (and thus only pay for download once, OR even many GC queues ...

Note, the questions of the underlying queue, and how to manage insertions, deletions, and reorderings is separate from the discussion on how we should handle the different scenarios. However they are not independent, as the discussion about a second GC queue for local data shows.

unknown

cobordism commented 5 years ago

Let's move this discussion over to https://swarmresear.ch/t/garbage-collection/16/2