No-arg macro-shape parameters could be unrepresentable in some cases

popematt commented 1 month ago

I've come across an edge case. Suppose I define the following macros:

(macro pi () 3.14159)
(macro dessert_buffet (pi::$pies*) [$pies])

Then, in Ion text, I could invoke it like this:

(:dessert_buffet (: () () ))

Which should produce

[3.14159, 3.14159]

However, this is not possible to represent in binary Ion (if the expression groups have a byte-size length prefix rather than a value-count length prefix).

There are several possible solutions:

Forbid macro-shaped parameters that are constants
Forbid */+ for macro-shaped parameters that are constants because ? and ! can work as-is.
Say that when the macro-shaped parameter is a constant, the byte length prefix becomes a count prefix.

Until a decision is made on this matter, we should use "Forbid macro-shaped parameters that are constants". This allows us to defer the decision to Ion 1.2 if we don't get to it now.

toddjonker commented 1 month ago

This feels like a defect in the binary spec. I advise against proposals 1 and 2 since they violate orthogonality expectations in ways that affect the user surface.

popematt commented 1 month ago

I don't have a strong preference for any of them. I think it's going to require weighing the costs and benefits of each. Here's why I think it's not simple to choose one that is clearly the best solution:

Proposal 1 (and 2) may violate orthogonality expectations.
Proposal 3 introduces a potential DoS vector (see below). I've mentioned a mitigation, but is it actually sufficient? Does the mitigation violate the orthogonality expectations?
What problem actually requires no-arg-macro-shaped parameters as a solution? The only use case I've been able to come up with is a hacky attempt at creating "tagless" boolean parameters (see #331). The use case of repeating a constant value can be solved in an efficient and sensible way in both text and binary Ion by using repeat.
Would no-arg-macro-shaped parameters be confusing to users?

And less importantly:

At the end of the day, Ion is intended to help humans (customers). Do we want to allow a questionably useful edge case that introduces (human-)unreadable syntax in text Ion? E.g.: (:my_macro (: () () () ) ) or (:other_macro ( ( () ) ) )
Proposal 3 would add more complexity to the reader and writer implementations for a questionably useful edge case whereas proposals 1 and 2 just requires an additional check in the macro compiler.
Proposal 2 adds the most complexity from a user perspective (zero-arg-macro-shapes can be used in some cases but not all)

Proposal 3 DoS vector Suppose I define the following macros which have addresses 0 and 1 respectively.

(macro one () 1)
(macro many (one::ones*) ones)

Then, I can construct a malicious input using those macros:

 ┌ Macro address 1
 │               Length Prefix representing a _count_ of expressions
 │             ┌ FlexUInt 9223372036854775807 (Long.MAX_VALUE)
┌┤ ┌───────────┴────────────┐
01 00 FF FF FF FF FF FF FF FF

If a system attempts to transcode this to Ion text, it will transform to the following:

(:many (: () () () () () () () /* ...9223372036854775800 more times */ ) )

The transformation has O(128ⁿ) space complexity, and a malicious actor could use this to repeatedly exhaust the memory of a remote system.

When the macro-shape has at least one parameter, then an attacker's payload must include at least one byte for every macro-shaped invocation; thus any expansion during transcoding is dominated by the linear factor.

The repeat macro/special form doesn't suffer from this because it has a short representation in both text and binary and so transcoding does not result in any expansion.

There is a related DoS vector when evaluating macros that use repeat, but that can be mitigated by having a limit on the number of expressions that can be produced by a macro expansion. This could be a user-configurable limit, and it would apply regardless of whether the data was text or binary. Furthermore, when reading/evaluating macros, if you really need to process a very large expansion, it is possible to craft your application logic in such a way that it looks at an expression, takes some action, and then drops that expression out of memory.

It may be possible to use the same or a similar mitigation for transcoding use cases. However, then we end up with asymmetry where some data can be represented in Ion binary, but no implementations of Ion 1.1 can encode it as Ion text (despite it being theoretically possible). Unlike reading, any transcoding use case will almost certainly result in the transcoded data being sent (in full) to some other destination (most likely a network connection, file, or user display), for which there isn't an equivalent workaround for handling really large expansions as there is with reading data.

popematt commented 1 week ago

Another potential solution is to change the binary encoding so that a no-arg-macro-shaped argument would no longer have a zero-sized representation.

Suppose we have these macros:

(macro na () "na ")
(macro batman (na::intro*) (make_string intro "batman!")

Example 1

(:batman () () () () () () () ()) 
   => "na na na na na na na na batman!"

In binary, this could be encoded as

01 02 11 EC EC EC EC EC EC EC EC
│  │  │  └──────────┬──────────┘
│  │  │             └ eight bytes indicating 8 invocations of `na`  
│  │  └ FlexUInt indicating expression group length is 8 bytes
│  └ Argument Encoding Bitmap indicating the first parameter is an expression group
└ Macro address 1 (batman)

In this example, I've chosen to use 0xEC, which is the opcode for a single-byte NOP, but it could be anything since the value itself is meaningless. This is very quick to read because we don't actually need to read any of the bytes in the expression group—we only need to know the length.

Example 2

If we're willing to introduce a little bit more complexity, we could encode each no-arg macro-shaped arg with a single bit to be more compact.

(:batman () () () () () ()) 
   => "na na na na na na batman!"

In binary, this could be encoded as:

0000 0001  0000 0010  0000 0011  0011 1111
└───┬───┘  └───┬───┘  └───┬───┘  └───┬───┘
    │          │          │          └ the 6 least significant bits are set, indicating 6 invocations of `na`.
    │          │          └ FlexUInt indicating expression group length is 1 byte
    │          └ Argument Encoding Bitmap indicating the first parameter is an expression group
    └ Macro address 1 (batman)

By requiring the least significant bits to be set, it's still simple to read. The bytes in the expression group can be read as if they are a FixedUInt of size N (N is the the length of the expression group in bytes). Then, the number of expressions in the group is N * 8 - numLeadingZeros().

amazon-ion / ion-docs

No-arg macro-shape parameters could be unrepresentable in some cases #332