Open felipedau opened 7 years ago
Due to the recent work I did for splitting long elements (#89), I felt like I should explain the current problem regarding the way unMessage serializes its packets. I think it will be easier by demonstrating with the python interpreter.
Setup:
>>> from math import ceil
>>>
>>> from twisted.internet import reactor
>>>
>>> from unmessage.contact import Contact
>>> from unmessage.elements import Element
>>> from unmessage.peer import a2b, Peer
>>>
>>> alice = Peer('alice', reactor)
>>> bob = Peer('bob', reactor)
>>> out_request = bob._create_request(Contact(alice.identity, alice.identity_keys.pub))
>>> in_request = alice._process_request(str(out_request.packet))
>>> conv_a = in_request.conversation
The plaintext of unMessage are the elements. These are objects which
contain information of actions involved in a conversation, such as
sending a message (MessageElement
), performing
authentication(AuthenticationElement
), etc. In this example I am
going to use the base class (Element
), but it could be any other:
>>> e = Element('example')
>>> e.serialize()
'{"content": "example"}'
Partial elements are objects representing an element possibly split into multiple parts. These objects can create element packets (when sending an element) or be created from element packets (when receiving an element).
The sender uses from_element
, passing the element that will be sent.
When the max_len
is omitted, a partial of a single part is created:
>>> partial = PartialElement.from_element(e)
>>> packets = partial.to_packets()
>>> packets
[ElementPacket(type_='elmt', id_='ynE=', part_num=0, part_total=1, payload='{"content": "example"}')]
>>> print str(packets[0])
elmt
ynE=
0
1
{"content": "example"}
With the ID, the receiver is able to know which element the part belongs to. With the number of the part, the receiver is able to group them in the right order. With the total, the receiver is able to identify when the partial element is complete and can finally become an element. With the type, the receiver is able to deserialize to the correct element class.
When passing the max_len
, the element is split into parts that fit
that length:
>>> partial = PartialElement.from_element(e, max_len=10)
>>> packets = partial.to_packets()
>>> for packet in packets:
... print str(packet)
... print
...
elmt
Ce4=
0
4
{"cont
elmt
Ce4=
1
4
ent":
elmt
Ce4=
2
4
"examp
elmt
Ce4=
3
4
le"}
>>> packets[0]
ElementPacket(type_='elmt', id_='Ce4=', part_num=0, part_total=4, payload='{"cont')
So far, no problems. Although simple, Element
, PartialElement
and
ElementPacket
seem to work well. With the element packet ready to be
sent, it is encrypted with pyaxo's AxolotlConversation
:
>>> plaintext = str(packets[0])
>>> len(plaintext)
20
>>> print plaintext
elmt
Ce4=
0
4
{"cont
>>> ciphertext = conv_a.axolotl.encrypt(plaintext)
>>> len(ciphertext)
140
Although pyaxo's overhead could be decreased, it is not the problem
because it is just additional 120 bytes for any plaintext. The problem
arises when they are encrypted and become a RegularPacket
. As we did
not have the packet format completely defined, the easiest way to
serialize the regular packets was just encoding all of its parts to
base64 and separate them with line breaks:
>>> encrypted_packet = conv_a._encrypt(packets[0])
>>> encrypted_packet
RegularPacket(iv='zIyS6FQP0I0=', iv_hash='/NJNYwRPSosgGO7ZXvttQwsPooAdljYySOZ7mywGMJk=', payload_hash='pgiptBOOcwLfggq65dEjLtLHDXTlr8dLy0bipgqmy3g=', handshake_key='', payload='KPKkymiouJHHZN1a93PNjm6Iibn6RNgpdRd2JQ5SPOQfs37XlLWuCG2LKLcLbi2uwfC9Mf6/ZOzzb4utcNPASid9BfMH+mbDFp9J/Ld/BQK8LSObFn5tRi01gEUu4ZvuiBb3bpFBg5LsEO+DIJKyvvjzFbpWIMvyl0G2rptj8/nJtlzX8IavtN+h6wQ=')
>>> len(str(encrypted_packet))
292
>>> print str(encrypted_packet)
zIyS6FQP0I0=
/NJNYwRPSosgGO7ZXvttQwsPooAdljYySOZ7mywGMJk=
pgiptBOOcwLfggq65dEjLtLHDXTlr8dLy0bipgqmy3g=
KPKkymiouJHHZN1a93PNjm6Iibn6RNgpdRd2JQ5SPOQfs37XlLWuCG2LKLcLbi2uwfC9Mf6/ZOzzb4utcNPASid9BfMH+mbDFp9J/Ld/BQK8LSObFn5tRi01gEUu4ZvuiBb3bpFBg5LsEO+DIJKyvvjzFbpWIMvyl0G2rptj8/nJtlzX8IavtN+h6wQ=
>>> byte_lens = [len(a2b(encrypted_packet.iv)), len(a2b(encrypted_packet.iv_hash)), len(a2b(encrypted_packet.payload_hash)), len(a2b(encrypted_packet.handshake_key)), len(a2b(encrypted_packet.payload))]
>>> byte_lens
[8, 32, 32, 0, 140]
>>> sum(byte_lens)
212
The size of the regular packet in bytes is 212. Due to this "serialization", the final string that is sent becomes much longer:
>>> line_breaks = 4
>>> base64_lens = [len(encrypted_packet.iv), len(encrypted_packet.iv_hash), len(encrypted_packet.payload_hash), len(encrypted_packet.handshake_key), len(encrypted_packet.payload), line_breaks]
>>> base64_lens
[12, 44, 44, 0, 188, 4]
>>> sum(base64_lens)
292
It does not seem such a big deal by growing "just" 80 bytes. The problem is that the biggest portion of this overhead is the payload's, which is variable:
>>> ceil(len(ciphertext) / 3.) * 4
188.0
Any payload of any size will be 34% bigger and it becomes a real problem when dealing with with long packets (e.g., file transfer).
What we have to do is defining exactly what we want in this packet format and then transmit only bytes and slice each part accordingly. I think the packet format we currently have is alright, but we need to read more about other formats and see if we need to add/remove something. One thing we definetely need is adding a version number (#53). Finally, we define the fixed size that packets should have and pad them (#58).
Notes:
If we move forward with #87, we can remove the handshake key from the packet. It is empty once the conversation has been established but when Alice is first replying to accept the request, it is 72 bytes that we have to expect before the payload.
After we decide about the size (or sizes) of packets, it would be interesting if we set the maximum length of strings based on the manager of the netstring receiver.
As suggested by @david415, once packets started to have a fixed size, more than one size could be used, for example, one for short elements and one for long elements.
The packets are currently "serialized" as:
After receiving a packet, its parts are "deserialized" with
packet.splitlines()
. Once the sizes are fixed, they should instead be serialized as:And then deserialized with slice operations, as the size that each part would be known.