Option to include compression of inlined messages.

petersilva commented 5 years ago

In Issue #3 , David Podeur raised the issue of compression. To me that is just another encoding. Current implementation just has "utf-8" and "base64", We would need to add something like "gzip-base64" ... or do we just switch to using that all the time? The thing is, I only see this being useful for small messages, and in small messages the size of the JSON message likely is so significant that compressing the payload doesn't change much in terms of overall bytes on the wire...

Tom Kralidis suggested doing some tests. One would need to pick some candidate algorithms, and try them out on some realistic data. Is compression something that belongs at this level? or wil lthe data representation teams not want compression here.

petersilva commented 5 years ago

from Yves Pelletier in the ET-DRC or MD ? not sure the current name of the team.

General purpose compression algorithms aren't great for most gridded meteorological data, and I would extend that assessment to most meteorological data in binary formats. WMO binary formats come with a requirement for compactness, and usually have in-built compression with appropriate algorithms for non-text data. Compression over the network (CON) would be beneficial mostly for text-based formats: XML, JSON, whatnot. Whether that is worth the effort, or may even be counterproductive, depends on the ratio of the volume of uncompressed data versus compact or compressed binary data that would be a waste of CPU to try to compress further.

Ideally, CON should be a choice that could be configured according to the type of data at hand, or made on the fly, or just-in-time. If that is not possible and CON must always be turned on, then I would argue for choosing a ruthlessly fast algorithm, even at the cost of not achieving a very high compression ratio. A very fast compression algorithm would still be reasonably effective with text formats, while just acting as a low-cost bit-scrambler on already dense data.

josusky commented 4 years ago

In the aviation world, that is very keen on XML based formats (IWXXM), the standard is to use gzip compression (of FTBP in AMHS messages). It is mentioned for example in https://www.icao.int/airnavigation/METP/Panel%20Documents/Guidelines%20for%20the%20implementation%20of%20OPMET%20data%20exchange%20using%20IWXXM.pdf

I think that gzip matches Yves requirements for "ruthlessly fast", "effective with text (XML)" and "low-cost bit-scrambler".

josusky commented 4 years ago

Quoting my own e-mail sent to ET-CTS:

We just need to standardize how to indicate it. One proposal was to use something like "gzip-base64". I am starting to feel more relaxed so I would suggest to use simply "gzip" as the base64 encoding on top of it can be considered implicit. Thus we would have three possible values for "encoding" of the "content" object:

utf-8 // suitable for TAC messages (basically no data conversion at all)

base64 // suitable for BUFR or other binary data that are hard to compress

gzip // suitable for almost everything, particularly XML (IWXXM, CAP)

petersilva commented 4 years ago

that all sounds good to me. no issue at all.

petersilva commented 4 years ago

question: do we make base64 the default, and so no "encoding" tag is needed in that case (then the German example from last week would have been OK without such a tag.)

petersilva commented 4 years ago

"content" : "base64encodedstring" or if there is more to it, then "content" : { "encoding": "gzip" , value: "base64encodedgzippeddata" } is that a desirable thing?

petersilva commented 4 years ago

another weirdness I did, not sure if it is OK: since one cannot have line feeds and carriage returns in JSON strings, when sending *utf-8", I replace them with \n and \r respectively.... hmm...

petersilva commented 4 years ago

sample American iwxxm file (available in the WMO_Sketch feed.) original XML, gzip compressed, then base64 encoded.

blacklab% ls -al LT*
-rw-r--r-- 1 peter peter 9793 Feb  1 11:21 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm
-rw-r--r-- 1 peter peter 2309 Feb  1 11:21 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.gz
-rw-r--r-- 1 peter peter 3121 Feb  1 13:05 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.gz.base64
blacklab%

derivation (on a linux server):


 wget http://hpfx.collab.science.gc.ca/~pas037/WMO_Sketch/20200201T16/GTS/KLMK/LT/LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm

gzip LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm

base64 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.gzip >LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.gzip.base64

petersilva commented 4 years ago

sample of zip of same file is larger: -rw-r--r-- 1 peter peter 2501 Feb 1 14:14 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.zip

josusky commented 4 years ago

another weirdness I did, not sure if it is OK: since one cannot have line feeds and carriage returns in JSON strings, when sending *utf-8", I replace them with \n and \r respectively.... hmm...

That is not a weirdness, that is the standard solution: https://www.json.org/json-en.html :-)

josusky commented 4 years ago

question: do we make base64 the default, and so no "encoding" tag is needed in that case (then the German example from last week would have been OK without such a tag.)

I am of double opinion here. Making base64 default would make certain messages (typically SYNOPs in BUFR) a bit smaller - and my code is already adapted to such data. On the other hand, it makes interpreting of messages a little bit more complex. Flip a coin and let me know the result :-)

josusky commented 4 years ago

sample of zip of same file is larger: -rw-r--r-- 1 peter peter 2501 Feb 1 14:14 LTUS43_KLMK_011621_AAA_cd8d8f0897d63cafde2bbe7bf76706b8.iwxxm.zip

My observations yield the same result. In fact about 2kB a needed for "dead weight" of a typical IWXXM report (when compressed) - so a multi-report IWXXM file that has 38 kB can be compressed to 3 kB. But back to the point, gzip is one of the best compression methods for text and XML and is so widely supported that there is hardly any point against it.

petersilva commented 4 years ago

philosophically: the key point is that there are some specialized compression methods, like efficient XML ( https://en.wikipedia.org/wiki/Efficient_XML_Interchange ) that have some traction, but they rely on knowledge of the format. The result is far simpler if we use methods that do not require an understanding of the format of the file being transferred. With Efficient XML, one needs access to the original data schema, and that informs the compression. I think such domain aware compression methods are unwise, and will lead to an explosion of methods to support (proportional to the diversity of file types being transferred.) A general method, and gzip is a prime candidate, is far preferable, as it will minimize the size and complexity of resulting implementations, Keeping the precise method explicit in the message is a means of future-proofing, in case gzip falls out of favour in a decade.

petersilva commented 4 years ago

on the coin flip question, we can explore a bit more: gzip has a ten byte header built-in. that header would have a recognizable transformation to base64, so we could ask implementations to identify gzipped data as well as base64 if we wanted to, without any encoding field. That gets even more implicit. my gut says that is kind of a bridge too far, but I don't know why.

petersilva commented 4 years ago

to summarize:

the default encoding expected would be base64.
once the base64 is unpacked, implementations could test for gzip and uncompress as needed (no explicit encoding specification, is needed.)
any other format would require "encoding" value, of which the only one not already there is 'utf-8'... 'gzip' and 'base64' specifications of that field are there for completeness but are optional.

petersilva commented 4 years ago

Another sort of question. Outside the WMO world, file suffixes are often significant. If someone has a file name that ends in .gz, I guess it just gets base64 encoded, but on reception, we need to check the filename? We should not unwrap the gzip even though it is there... OK... This is why it should not be implicit... leads to ambiguity. hmm...


blacklab% ls -l XO*
-rw-rw-r-- 1 peter peter 3134 Jan 22  2019 XOUS56_KWBC_230100_f6aa587c99bf2aa2a484d916a668542d.cap
-rw-r--r-- 1 peter peter 4235 Feb  6 09:10 XOUS56_KWBC_230100_f6aa587c99bf2aa2a484d916a668542d.cap.base64
-rw-r--r-- 1 peter peter 5723 Feb  6 09:10 XOUS56_KWBC_230100_f6aa587c99bf2aa2a484d916a668542d.cap.base64.base64
blacklab%

I think I'm just being cute/stupid. If you have an input file that is actually in the default format, then removing that format on reception will corrupt the file for upper layer purposes. so maybe implicit format is just bad.

josusky commented 4 years ago

I am getting a bit lost, English my mother language is not. :-) So let me try to summarize the compression issue in simple words and then tell me if I got it right or wrong: We have (are about to) introduce compression of the "content" field in the JSON that forms the payload of messages/notifications exchanged in the WMO mesh (prototype of a WIS 2.0 component that aims to replace GTS). Purpose of the compression is to keep those messages/notifications reasonably small (and it is mostly needed because station observations and forecasts in IWXXM are by an order of magnitude bigger than ... er ... is needed or practical or as was the old TAC format). Thus, if some centers decides to publish an IWXXM bulletin (or any other bloated data) by putting it into the content field, it is highly desirable to compress it (and make it clear that it is compressed). But the receiving center may decide to store it as a file and forward MQTT/AMQP message without the "content" field. In such case the stored file will be a plain XML (and have the name originally indicated by the publisher). If the URL from the forwarded notification is (later) used by yet another center to request the (XML) file, it could again be compressed for the purpose of the transfer (HTTP has mechanism for negotiating of transfer encoding that includes compression). In other words the encoding/compression of the "content" field is an implementation detail of a data transfer between two adjacent nodes and does not need to be preserved. On the other hand, if some center publishes notification about a file called example.abc.gz this information should be forwarded without changing/decompressing it to example.abc. If someone decides to embed this file into the "content" field (because it happens to be reasonably small), then it shall be base64 encoded for the purpose of the transfer (requirement of our JSON based protocol). Center that does not like messages/notifications with "content" field may decide to store the file and it shall perform only base64 decoding, not gunzipping. Of course, some center my decide to inline/embed example.abc.gz and gzip the "content" field. It will not really reduce the size but it won't brake anything as long as the "encoding" is set to "gzip" - receiving center will apply base64 decoding and gunzipping and the result will be a valid example.abc.gz file (that will be binary identical to the original, and could be verified using the checksum ("integrity") field.

If my summary it right, then it means that file name ("relPath") shall not be changed/adjusted when the file is put into the "content" field - even not if the "encoding" of the "content" is "gzip". This also means that the only encoding that could possibly be default/implicit is base64 ("utf-8" would be technically eligible too, but "base64" is already used that way). In fact, I am ready to sacrifice support for default/implicit encoding - for the sake of clarity. Of course, a center may decide to process a file and republish the new file(s) - most probably under a different name (highly recommendable); and the processing could possibly be a compression/decompression - but that is unrelated to the compression of inlined/embedded messages.

petersilva commented 4 years ago

You have summarized exactly. At most, only base64 can be implicit. if gzip is used in addition, then there must be an "encoding". I am now thinking that clarity is better, and that the implicit coding will confuse someone, some day, so just have explicit encoding always is much simpler and clearer.

For the other point about changing relPath. My first thought is that if the relPath changes on receipt, then when it is re-advertised, it is a different product, and will have a new life in the Mesh. It will not be related to the previously circulated one.

The other thing is... if say, a file format is base64 encoded already, should implementations notice that the source is already base64 and send it as "utf-8" (thus un-modified) or should they re-encode into double base64, with the size penalty from double encoding. In my stack, it goes over the content of the buffer, and if it contains valid utf-8 then it sends it as such (preferred.) but a stack that prefers base64, will re-encode and de-code at each hop. I guess it doesn't matter, as both will work. What will not work (is ambiguous) is for implementations to look at the relPath to figure out it is base64 already, and then optionally not decode the base64.

The rule: implementations shall not look at relPath for hints about encoding.
encoding information comes exclusively from the "encoding" value in the "content" header. Does that make sense?

petersilva commented 4 years ago

new summary:

if content is included in a message, there must always be an "encoding" specified.
the initially agreed "encoding" values are: "base64", "gzip", "utf-8" . (where the gzipped data is subsequently also base64 encoded.)
no other field should be used to determine encoding or decoding of the "value" field.

josusky commented 4 years ago

Agreed.

MetPX / wmo_mesh

Option to include compression of inlined messages. #9