Combinatorial explosion

pavel-kirienko commented 5 years ago

The following benign-looking definition is parsed very quickly (some milliseconds), but any attempt to determine its bit length set (e.g., using @assert _offset_ <...> or to generate code) takes a very, very long time (possibly hours):

uavcan.register.Value.0.1[<=10] b

The referenced type uavcan.register.Value.0.1 has a very complex layout (thousands of bit length options), so some delay is expected. However, a delay of more than several seconds is highly undesirable, and anything over ~20 seconds is unacceptable. What makes things worse is that most of the computation is performed in the native context, so the user can't interrupt it with SIGINT/CTRL+C.

The bit length set computation logic should be reviewed some day, it should be very optimizable.

The problem does not affect any of the standard definitions.

thirtytwobits commented 4 years ago

This is just because of the polynomial-time operation in BitLengthSet.elementwise_sum_cartesian_product, right? I don't think this is an optimization task; I think it's a theoretical task to find out if there's another algorithm you can use to implement this method with a better time complexity.

thirtytwobits commented 4 years ago

This also isn't very much of an edge case. Take this little example:

@_in_n_out
def _unittest_simulated_variant() -> None:

    _parse_definition(
        _define('ns/A.1.0.uavcan', dedent(r'''
            ns.OptionalString.1.0[<=3] b

            @assert _offset_.min == 8
        ''')),
        [
            _define('ns/OptionalString.1.0.uavcan',
                    dedent(r'''
                    @union

                    ns.Empty.1.0 empty
                    ns.String.1.0 string

                ''')),
            _define('ns/Empty.1.0.uavcan', ''),
            _define('ns/String.1.0.uavcan', 'uint8[<=256] value'),
        ])

84290 function calls (80783 primitive calls) in 0.566 seconds

...
        ns.OptionalString.1.0[<=4] b
...

105249 function calls (101482 primitive calls) in 36.676 seconds

...
        ns.OptionalString.1.0[<=5] b
...

127776 function calls (123749 primitive calls) in 1948.168 seconds

This is on my 2.4 GHz Intel i9 mac. Thus is the wonders of polynomial time!

pavel-kirienko commented 4 years ago

Are these counts of mostly BitLengthSet.elementwise_sum_cartesian_product? I suspect there are optimization possibilities to explore aside from reviewing the algorithm itself, I did not spend time on it.

thirtytwobits commented 4 years ago

Are these counts of mostly BitLengthSet.elementwise_sum_cartesian_product

Sorta/not-really/doesn't manner. The execution time is what I'm showing here to demonstrate the polynomial expansion of the complexity. I could share the .prof file with you but I'm telling you that digging through optimizer output is a waste of time at this point.

I suspect there are optimization possibilities to explore aside from reviewing the algorithm itself

I doubt it. The time complexity defined on line _bit_length_set.py#L179 and then executed on line _bit_length_set.py#L180 is polynomial and will not be made significantly better by optimizing machine code. My suspicion is you need some sort of lazy evaluation mechanism built into BitLengthSet to enable a nominal case that is acceptable but that any elimination of the worst case will require API changes to this class.

pavel-kirienko commented 4 years ago

demonstrate the polynomial expansion of the complexity.

Well, of course, it's the documented behavior. My point was that for limited n there might be practical possibilities for low-cost optimization that are cheaper although low-gain compared to reviewing the algorithm.

pavel-kirienko commented 4 years ago

gpu

So, long-term we have the following options:

Rewrite offset computation in OpenCL and use that if OpenCL drivers are available.
Resurrect dsdl_parser.rs and redefine PyDSDL as a wrapper over the native Rust library.
Redefine bit length set in the Specification (this is a bad idea but I am listing it here for completeness).

pavel-kirienko commented 4 years ago

Here's another interesting example I just encountered.

If A.1.0 is defined as:

float16 x
float16 y
@sealed

Then the following is processed instantly (a few milliseconds):

A.1.0[<64] x
@assert _offset_ % 8 == {0}

But if A.1.0 is not @sealed, the evaluation of the assertion check directive takes 11 minutes. This is because delimited (non-sealed) types behave like arrays of bytes (uint8[<=4][<64]), which results in the combinatorial explosion.

pavel-kirienko commented 4 years ago

This is also going to be a problem for large types like image frames or point clouds. My setup runs out of memory trying to evaluate the bit length set of the following definition unless it is @sealed:

uint16 PIXELS_PER_ROW = 3840
uint16 ROWS_PER_IMAGE = 2748
uint32 PIXELS_PER_IMAGE = PIXELS_PER_ROW * ROWS_PER_IMAGE
uavcan.time.SynchronizedTimestamp.1.0 timestamp
void8
uint8[PIXELS_PER_IMAGE * 3] pixels

thirtytwobits commented 4 years ago

I think we need to fix this is the specification. If we can't implement this trivially in Python then something is fundamentally too complex about our technology. There must be a simplifying factor we can apply that either reduces the computational complexity or introduces a hard ceiling on it. I'd only want to talk about using GPU or other optimizations in the context of making something that takes seconds take only milliseconds.

pavel-kirienko commented 4 years ago

As discussed at the call: it might be possible to find a way to implement the required behaviors efficiently without changing the definition of the bit length set. We should either modify the existing implementation or at least prove that they are in principle possible. If not, we will need to go back to the Spec.

Right now we need help with defining an algorithm that yields identical output but does so in a more efficient way. This can be approached by adding heuristics that detect and avoid unnecessary computation.

pavel-kirienko commented 3 years ago

While doing something completely unrelated I decided to quickly check PyPy and found that it's actually almost twice slower:

$ time python test.py > /dev/null
194 data types in 2.6 seconds
python test.py > /dev/null  5,37s user 0,03s system 99% cpu 5,406 total

$ time pypy3 test.py > /dev/null
194 data types in 4.4 seconds
pypy3 test.py > /dev/null  9,61s user 0,70s system 99% cpu 10,341 total

Edit: this is not surprising because the JIT compiler is not expected to be beneficial in short-living processes.

pavel-kirienko commented 3 years ago

Replacing the native combinatorial functions (specifically, multicombinations and the Cartesian product) with their alternatives from NumPy might be a way to speed things up cheaply. I don't want to make NumPy a hard dependency so we could have two versions of BitLengthSet where the NumPy-enabled one is chosen at import time if NumPy is available.

https://numpy.org/doc/stable/reference/generated/numpy.meshgrid.html

pavel-kirienko commented 3 years ago

Edit: this is not surprising because the JIT compiler is not expected to be beneficial in short-living processes.

PyPy (v7.3) is slower even for not so short-living processes:

pypy3 -m nunavut --target-language c --enable-serialization-asserts public_regulated_data_types/uavcan

-- 1 minute 39 seconds

python3.9 -m nunavut --target-language c --enable-serialization-asserts public_regulated_data_types/uavcan

-- 57 seconds

OpenCyphal / pydsdl

Combinatorial explosion #23