binast / binjs-ref

Reference implementation for the JavaScript Binary AST format
https://binast.github.io/binjs-ref/binjs/index.html
Other
431 stars 38 forks source link

Assertion failure: assert next_code_length >= code_length when code_length overflows #440

Open arai-a opened 5 years ago

arai-a commented 5 years ago

Steps to reproduce:

  1. run the following Python script and redirect the output to test.js
    for i in range(1, 23):
        for j in range(0, int(1.7**i)):
            print '{};'.format(i)

(the result is 1.1MB of JS file)

  1. encode the JS file generated in step 1 with fbssdc

Actual result:

crash at https://github.com/binast/binjs-ref/blob/5cdddbcef1f1b185a4936ac996180aa2cce54571/fbssdc/model.py#L472

AssertionError: (2097151, 21), 20, 21
Yoric commented 5 years ago

Cc @dominiccooney

So, we have too many numbers in the table of numbers?

arai-a commented 5 years ago

it tries to assign 21-bit length code for 1 and 2 symbols.

we'll need to define the fallback algorithm for this kind of case. (IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine)

dominiccooney commented 5 years ago

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

We can and should leave code length limiting algorithms as an implementation detail of the encoder, because there are slow, accurate algorithms and fast approximating algorithms and depending on the circumstances either could be preferable. We don't want to bake one into the spec and the decoder doesn't care.

If you need one to crib from you can dig up papers on length limited Huffman codes for the accurate algorithm or zstd has a fast approximation (probably many other Huffman compressors too.) Facebook does something like zstd in our Hack encoder.

On Tue, Sep 10, 2019, 11:35 PM arai-a notifications@github.com wrote:

it tries to assign 21-bit length code for 1 and 2 symbols.

we'll need to define the fallback algorithm for this kind of case. (IMO, this doesn't happen in usual wild case, so just falling back to assign same length code for all items should be fine)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/binast/binjs-ref/issues/440?email_source=notifications&email_token=AAANOUAT6TEVDHYZ25CBIBTQI6WDJA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LKB7A#issuecomment-529965308, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANOUDW2CQMFD6N6VX6LFDQI6WDJANCNFSM4IVIDCFA .

Yoric commented 5 years ago

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :)

dominiccooney commented 5 years ago

Ok. A correct encoder should not output lengths > 20 bits and SpiderMonkey should reject files with 21+ bit length codes.

BTW 20 bits was an arbitrary choice and may be something to revisit. Some people may need a million strings... I hope not but I would not be surprised if there's a file out there that big.

FWIW Facebook does have files where the optimal code has lengths > 20 bits and hence we use that heuristic to limit code lengths.

On Fri, Sep 13, 2019, 2:27 PM David Teller notifications@github.com wrote:

That algorithm is a non-normative illustration of assigning codes, don't take it as spec. We should add a note about the code length issue.

Ah, good to know. We're using it as spec all over SpiderMonkey for the time being :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/binast/binjs-ref/issues/440?email_source=notifications&email_token=AAANOUCQHMNQYCF5QFEFPATQJMQDVA5CNFSM4IVIDCFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6T732A#issuecomment-531103208, or mute the thread https://github.com/notifications/unsubscribe-auth/AAANOUDDAO53YJPQB3ZVDETQJMQDVANCNFSM4IVIDCFA .

Yoric commented 5 years ago

Oh, got it, 20 is part of the spec, but the algorithm isn't.

So, given that there are real-world files where 20 isn't sufficient for an optimal code, do we wish to keep 20 as a hard limit? Or do we expect to become suboptimal for huge files (and possibly issue a warning telling the developers that their file is simply too large)?