MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
44 stars 22 forks source link

v03 should base64 encoding include line feeds? line feeds in body should be OK. #234

Closed petersilva closed 2 years ago

petersilva commented 5 years ago

implemented v03 posting in sarrac (the c implementation) and it did base64 encoding where after 76 encoded bytes, one needs to put in a line feed. When initially doing interop tests between C and python. noticed python implementation failed if the line feeds were included. This happens:

2019-07-21 09:19:10,865 [ERROR] sr_consumer/consume malformed message {'frame_method': (60, 71), 'frame_args': b'\x00<\x00G\x00\x00\x00\x00\x00\x00\x00\x01\x00\x16xs_tsource_cpost_watch\x1ev03.post.home.peter.src.sarrac\x00\x00\x00\x00', 'properties': {'content_type': 'text/plain', 'content_encoding': 'utf-8', 'delivery_mode': 2}, '_pending_chunks': [], 'body_received': 478, 'body_size': 478, 'ready': True, 'body': '{\n\t"pubTime" : "20190721T131909.689472999",\n\t"baseUrl" : "sftp://sarra_test@localhost",\n\t"relPath" : "/home/peter/src/sarrac/uthash.h",\n\t"from_cluster" : "localhost",\n\t"to_clusters" : "localhost",\n\t"size" : "58550",\n\t"atime" : "20190721T031946.770120617",\n\t"mode" : "644",\n\t"mtime" : "20190613T002434.125290492",\n\t"integrity" : {  "method" : "sha512", "value" : "LjAHu101pfeUyaWTa+gt6hilhxtWiMKtFnwzOrpiH99uN3ryv+huYCgPpZS2OaRs1WpFqfoGNg1Zf6\nT9DeFzOw=="  } ,\n\t"toto" : "pig"\n\t}\n', 'channel': <amqp.channel.Channel object at 0x7f70dded8780>, 'delivery_info': {'delivery_tag': 1, 'redelivered': False, 'exchange': 'xs_tsource_cpost_watch', 'routing_key': 'v03.post.home.peter.src.sarrac', 'message_count': 0}, 'isRetry': False}
2019-07-21 09:19:10,865 [DEBUG] Exception details: 
Traceback (most recent call last):
  File "/home/peter/src/sarracenia/sarra/sr_consumer.py", line 170, in consume
    self.msg.from_amqplib(self.raw_msg)
  File "/home/peter/src/sarracenia/sarra/sr_message.py", line 257, in from_amqplib
    self.headers = json.loads(msg.body)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 11 column 129 (char 441)

For now, the C version was changed to insert fake line feeds (literally \n ) in the string instead of line feeds, so the existing python stuff works. This requires some investigation. Very likely real line feeds should be used.

Similarly C implementation has an option that prints json more pretty (with linefeeds after each header), and that crashes the python also.

petersilva commented 5 years ago

To test this, need v03 branch of sarrac, then look in sr_util.c for the hex2base64 routine, where the correct code for line feeds is in comments, easiest way to find is to search for 76 in the code...

ghost commented 4 years ago

https://github.com/MetPX/sarrac/issues/18

ghost commented 4 years ago

https://stackoverflow.com/questions/2392766/multiline-strings-in-json

petersilva commented 4 years ago

great reference! ah, since json doesn't like splitting strings over newlines, looks like the python implementation is doing the right thing. Unfortunately, that means we're going to end up with a line the size of base64 encoding of the largest file embedded + all the backlashes and n's to represent the line feeds... hmm... likely something will break. I guess the question is answered though.

petersilva commented 4 years ago

turns out we were wrong...

from @josusky :+1:

Any non-alphabet characters in base64 encoded data are to be ignored. That means it is possible to add new line but it is not desired. RFC 4648 basically says that implementations MUST NOT add line feeds unless other (higher level) specification requires it. As we have higher level specification (JSON) that forbids new lines, I consider this topic closed :-) (For more information about base64 encoding see https://tools.ietf.org/html/rfc4648)

Jan is right. It isn´t base64 itself that wants line feeds every so often, but many libraries that claim to use it that do, because of MIME. I was tricked by the python library doing the encoding for me. Looking more closely at the relevant standards, the normative IETF RFC for base64: https://tools.ietf.org/html/rfc4648#page-3


3.1.  Line Feeds in Encoded Data

   MIME [4] is often used as a reference for base 64 encoding.  However,
   MIME does not define "base 64" per se, but rather a "base 64 Content-
   Transfer-Encoding" for use within MIME.  As such, MIME enforces a
   limit on line length of base 64-encoded data to 76 characters.  MIME
   inherits the encoding from Privacy Enhanced Mail (PEM) [3], stating
   that it is "virtually identical"; however, PEM uses a line length of
   64 characters.  The MIME and PEM limits are both due to limits within
   SMTP.

   Implementations MUST NOT add line feeds to base-encoded data unless
   the specification referring to this document explicitly directs base
   encoders to add line feeds after a specific number of characters.

The natural python library choice instead implements RFC2045:

base64.encode(input, output) Encode the contents of the binary input file and write the resulting base64 encoded data to the output file. input and output must be file objects. input will be read until input.read() returns an empty bytes object. encode() inserts a newline character (b'\n') after every 76 bytes of the output, as well as ensuring that the output always ends with a newline, as per RFC 2045 (MIME).

There is another routine b64encode that does the right thing...

need to switch to that.

petersilva commented 4 years ago

so just reviewed the source code, and it is using b64encode/decode, which is the new interface, and thus does not embed line feeds in the content. Confirmed by inspection of sample data from WMO_sketch stream. the base64 encoded strings have no embedded fake line feeds, just a stream of bytes.

petersilva commented 4 years ago

for utf-8 encoding, just using python json module. strings in JSON cannot contain newlines, so the dumps routine does replacement of line feeds and carriage returns with \n and \r respectively. json.loads does the right thing to reverse the transformation, so it should work. Is this the right standard thing to do?

some reading materials:

I think the python library handling is fine, and should be adopted, but views may differ.

petersilva commented 4 years ago

I found the problem that I remembered. While the content header is properly encoded and decoded (using base64.b64encode() the integrity checksum value uses codecs.encode( ... ,'base64') which does things MIME-style. so the checksum encoding/decoding does need to be updated.

petersilva commented 4 years ago

in both C and python versions.

petersilva commented 2 years ago

answer: No ... no linefeeds in checksum body.