Open junchen108 opened 6 years ago
Found some bugs in the ingress server implementation. One of these bugs should be the cause of this issue. @lbarman @italod @stevens-leblond @aledcuevas
IG uses a slice (dynamic array in go) to store connections, e.g. [conn1, conn2, conn3, conn4, ...]
To dispatch the connection, a for loop is used, it iterates from the first element until the correct connection is found.
Since this slice grows over time and new connections are always stored at the end, new write operations will all spend more and more time to finish.
I am replacing the slice by a map, the performance is being tested.
The ID that IG assigns to every new connection is not unique.
//generateID generates an ID from a private key
func generateRandomID() string {
var n uint32
binary.Read(rand.Reader, binary.LittleEndian, &n)
return strconv.Itoa(int(n))
}
The generated random ID is a string from an integer, for example, "3503164205", "1433617593", etc. However, IG uses only the first 4 bytes of the generated ID to identify connections.
// simplified and modified code
id := generateRandomID()
...
mc.ID_bytes = ID_bytes[0:4]
...
ID := incomingData[0:4]
...
if bytes.Equal(mc.ID_bytes, ID) {...}
Basically there are only 9 10 10 * 10 = 9000 possible IDs, and the duplication occurs very often.
This means that over time more and more connections are dispatched in a wrong way, the correct connection doesn't receive its data.
The write operation v.conn.Write(data)
can return broken pipe error. This error is not handled by the current code.
IG stores all connections and never remove non-active connections from its data structure. I am not sure if we need to keep all connections, but I don't have a way to identify closed connections.
The probability that the Bug 2 is the root of this "performance drop" issue is very high.
Thanks a lot @junchen108 for the analysis. I will try to confirm
Each connection has its own randomID generated by sampling 32 random bits (one int32
).
Basically there are only 9 * 10 * 10 * 10 = 9000 possible IDs, and the duplication occurs very often.
Well perharps there's another bug here, but 4 bytes
is 2^8 * 2^8 * 2^8 * 2^8 = 2^32
, not 9000
.
The bytes are taken from ID_byte[0:4]
, not id[0:4]
(where your analysis would be correct).
BTW it seems that 32
bits is also perhaps too low: https://preshing.com/20110504/hash-collision-probabilities/, say we have 10k
channels, probability of collision is 1/100
which is not negligible. We should log things there to be sure.
@lbarman Hi, thanks for looking at this.
The randomID is generated by sampling 32 random bit, this is true.
func generateRandomID() string {
var n uint32
binary.Read(rand.Reader, binary.LittleEndian, &n)
return strconv.Itoa(int(n))
}
But this function returns the string
of this random int (strconv.Itoa(int(n))
), for example "3503164205", "1433617593", etc.
Then
id := generateRandomID() // string of an int
mc := new(MultiplexedConnection)
mc.conn = conn
mc.ID = id
ID_bytes := []byte(id) // this line returns UTF-8 encoded version of the string
mc.ID_bytes = ID_bytes[0:4] // UTF-8 encoded version of the first 4 characters
For instance, if random id = "3503164205"
, what the connection really gets is 0x33 0x35 0x30 0x33
, which is the UTF-8 encoded version of '3' '5' '0' '3'.
Hence, I think there are only 9000 possibilities. (1-9, 0-9, 0-9, 0-9)
It will work, if we don't do strconv.Itoa(int(n))
.
Right, nice catch ! I let you do the change since you're already working on the 1st point you mentioned.
ok!
415548f
should fix the ID collision bug.
There is one more thing that I want to do. The egress server (EG) of the relay stores all the connections of the whole PriFi network into a map. The key is the ID that each connection gets from its client.
This is not ideal, because if there are only two clients that generate two same IDs, a collision occurs again.
To avoid this problem, EG should track the connections in its own way.
After having PriFi running for 45 to 60 minutes, we can usually experience some performance drop on the client side, which results
Restart PriFi on the client can fix the problem.
One of the potential roots of the issue is the ingress server. The following logs show (already after a long period of execution),
Unusual long processing time between connections. (2 seconds, 4 seconds, and can be even longer).
A lot of connection resets