Performance drop over time

junchen108 commented 6 years ago

After having PriFi running for 45 to 60 minutes, we can usually experience some performance drop on the client side, which results

All network-related tasks become extremely slow, e.g. long loading session.
Cannot load anything and trigger timeout.

Restart PriFi on the client can fix the problem.

One of the potential roots of the issue is the ingress server. The following logs show (already after a long period of execution),

Unusual long processing time between connections. (2 seconds, 4 seconds, and can be even longer).

08-25 19:42:33.179 29802-29954/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 590923822
08-25 19:42:35.579 29802-29954/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 2773462216
08-25 19:42:36.755 29802-29830/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 621338519
08-25 19:42:38.646 29802-29830/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 3599311296
08-25 19:42:42.547 29802-5690/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 1970092045
08-25 19:42:44.049 29802-29823/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 2778976067

A lot of connection resets

08-25 19:42:47.777 29802-30716/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 2437827750
08-25 19:42:51.952 29802-29954/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 561995135
08-25 19:42:55.254 29802-29830/ch.epfl.prifiproxy E/GoLog: E : (stream-multiplexer.(*IngressServer).ingressConnectionReader: 191) - Ingress server: connectionReader error, read tcp 127.0.0.1:8080->127.0.0.1:56679: read: connection reset by peer
08-25 19:42:55.255 29802-29830/ch.epfl.prifiproxy E/GoLog: E : (stream-multiplexer.(*IngressServer).ingressConnectionReader: 191) - Ingress server: connectionReader error, read tcp 127.0.0.1:8080->127.0.0.1:56677: read: connection reset by peer
08-25 19:42:55.255 29802-5690/ch.epfl.prifiproxy E/GoLog: E : (stream-multiplexer.(*IngressServer).ingressConnectionReader: 191) - Ingress server: connectionReader error, read tcp 127.0.0.1:8080->127.0.0.1:56676: read: connection reset by peer
E : (stream-multiplexer.(*IngressServer).ingressConnectionReader: 191) - Ingress server: connectionReader error, read tcp 127.0.0.1:8080->127.0.0.1:56673: read: connection reset by peer
08-25 19:43:03.246 29802-5690/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 2902954695
08-25 19:43:03.470 29802-29955/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 1253489795
08-25 19:43:03.753 29802-29830/ch.epfl.prifiproxy I/GoLog: 2 : (   stream-multiplexer.StartIngressServer: 103) - Ingress server just accepted a connection, assigning ID 3092217935

junchen108 commented 6 years ago

Updates 02.09.2018

Found some bugs in the ingress server implementation. One of these bugs should be the cause of this issue. @lbarman @italod @stevens-leblond @aledcuevas

Recap:

All "prifi-ed" connections are sent first to the ingress server (IG).
IG gives each connection an ID and store them into a data structure.
When data comes in, IG uses the ID to dispatch the correct connection and perform the write operation.

Bugs:

IG uses a slice (dynamic array in go) to store connections, e.g. [conn1, conn2, conn3, conn4, ...] To dispatch the connection, a for loop is used, it iterates from the first element until the correct connection is found.
Since this slice grows over time and new connections are always stored at the end, new write operations will all spend more and more time to finish.
I am replacing the slice by a map, the performance is being tested.
The ID that IG assigns to every new connection is not unique.
```
//generateID generates an ID from a private key
func generateRandomID() string {
var n uint32
binary.Read(rand.Reader, binary.LittleEndian, &n)

return strconv.Itoa(int(n))
}
```
The generated random ID is a string from an integer, for example, "3503164205", "1433617593", etc. However, IG uses only the first 4 bytes of the generated ID to identify connections.
```
// simplified and modified code
id := generateRandomID()
...
mc.ID_bytes = ID_bytes[0:4]
...
ID := incomingData[0:4]
...
if bytes.Equal(mc.ID_bytes, ID) {...}
```
Basically there are only 9 10 10 * 10 = 9000 possible IDs, and the duplication occurs very often.
This means that over time more and more connections are dispatched in a wrong way, the correct connection doesn't receive its data.
The write operation v.conn.Write(data) can return broken pipe error. This error is not handled by the current code.
IG stores all connections and never remove non-active connections from its data structure. I am not sure if we need to keep all connections, but I don't have a way to identify closed connections.

Summary:

The probability that the Bug 2 is the root of this "performance drop" issue is very high.

lbarman commented 6 years ago

Thanks a lot @junchen108 for the analysis. I will try to confirm

lbarman commented 6 years ago

Each connection has its own randomID generated by sampling 32 random bits (one int32).

Basically there are only 9 * 10 * 10 * 10 = 9000 possible IDs, and the duplication occurs very often.

Well perharps there's another bug here, but 4 bytes is 2^8 * 2^8 * 2^8 * 2^8 = 2^32, not 9000. The bytes are taken from ID_byte[0:4], not id[0:4] (where your analysis would be correct).

BTW it seems that 32 bits is also perhaps too low: https://preshing.com/20110504/hash-collision-probabilities/, say we have 10k channels, probability of collision is 1/100 which is not negligible. We should log things there to be sure.

junchen108 commented 6 years ago

@lbarman Hi, thanks for looking at this.

The randomID is generated by sampling 32 random bit, this is true.

func generateRandomID() string {
    var n uint32
    binary.Read(rand.Reader, binary.LittleEndian, &n)

    return strconv.Itoa(int(n))
}

But this function returns the string of this random int (strconv.Itoa(int(n))), for example "3503164205", "1433617593", etc.

Then

id := generateRandomID() // string of an int
mc := new(MultiplexedConnection)
mc.conn = conn
mc.ID = id
ID_bytes := []byte(id) // this line returns UTF-8 encoded version of the string 
mc.ID_bytes = ID_bytes[0:4] // UTF-8 encoded version of the first 4 characters

For instance, if random id = "3503164205", what the connection really gets is 0x33 0x35 0x30 0x33, which is the UTF-8 encoded version of '3' '5' '0' '3'.

Hence, I think there are only 9000 possibilities. (1-9, 0-9, 0-9, 0-9)

It will work, if we don't do strconv.Itoa(int(n)).

lbarman commented 6 years ago

Right, nice catch ! I let you do the change since you're already working on the 1st point you mentioned.

junchen108 commented 6 years ago

ok!

junchen108 commented 6 years ago

415548f should fix the ID collision bug.

There is one more thing that I want to do. The egress server (EG) of the relay stores all the connections of the whole PriFi network into a map. The key is the ID that each connection gets from its client.

This is not ideal, because if there are only two clients that generate two same IDs, a collision occurs again.

To avoid this problem, EG should track the connections in its own way.

dedis / prifi