The incoming YAML document exceeds the limit: 3145728 code points.

clj-commons / clj-yaml

YAML encoding and decoding for Clojure

Other

122 stars 26 forks source link

The incoming YAML document exceeds the limit: 3145728 code points. #94

Closed kwladyka closed 1 year ago

kwladyka commented 1 year ago

When read large YAML

The incoming YAML document exceeds the limit: 3145728 code points.

the code_point_limit is needed to overwrite, but I didn’t find a way to do this with clj-yaml.

How do you read large YAML files?

From slack #clj-yaml

There isn't currently an option, but you can call make-loader followed by (.setCodePointLimit ...) on it. But I don't see where you can then pass that loader. That should probably be added as well, along with an explicit option for the :code-point-limit

PetrGlad commented 1 year ago

I have a large file, loaded as

(def all-of-it (yaml/parse-stream (javaio/reader "a-file.yaml") :load-all true))

Curiously, when I evaluate it in a REPL, this one raises the exception (map identity all-of-it), but this one does not (take 1000000 all-of-it) does not (the limit is larger than the number of documents in the input file).

borkdude commented 1 year ago

@PetrGlad Can you wrap that take in a doall?

PetrGlad commented 1 year ago

Sorry, it looks like a reproduction includes to attempt an operation on the sequence first, then other operations succeed. Like

(def all-of-it (yaml/parse-stream (javaio/reader "a-file.yaml") :load-all true))
(doall (map identity all-cases)) ; <-- FAILS
(doall (map identity all-cases)) ; <-- OK

It seems it does not matter which operation was tried first. These are evaluated in REPL, so I think doall should not change the behavior.

borkdude commented 1 year ago

If anyone wants to do a PR, we're open to that. It should be relatively straightforward to add:

an option to the size
an option to work with a pre-defined loader

PetrGlad commented 1 year ago

Just wanted to note that the actual problem (in my case) is in the snakeyaml. I have already reported that. Snakeyaml have enforced the input size limit, but it actually limits the whole input stream size, while it only makes sense to limit document size instead. For example this makes difference when input stream contains many small documents. Making the limit configurable would be a workaround, nonetheless.

lread commented 1 year ago

Thanks for following up @PetrGlad!

Just wanted to note that the actual problem (in my case) is in the snakeyaml. I have already reported that.

Was it this issue here?

Snakeyaml have enforced the input size limit, but it actually limits the whole input stream size, while it only makes sense to limit document size instead. For example this makes difference when input stream contains many small documents. Making the limit configurable would be a workaround, nonetheless.

Is there a separate SnakeYAML issue to address this too?

PetrGlad commented 1 year ago

Yes, that was the change. I sent a message to google groups because other services were locked down due to attacks. They admitted that it is likely a problem but I do not know if a ticket was created (here).

lread commented 1 year ago

@PetrGlad, I don't see a SnakeYAML issue created for that either. I think SnakeYAML issues on Bitbucket might be still a bit wonky. We can see them now, but maybe not create new issues yet. A friendly reply/reminder on your thread in the SnakeYAML mailing list would probably be helpful to Andrey.

lread commented 1 year ago

I pinged Andrey and he responded:

It was fixed without the ticket. Feel free to create one - we can check how it works (it should be pre-moderated now)

https://bitbucket.org/snakeyaml/snakeyaml/wiki/Changes

Andrey

lread commented 1 year ago

@PetrGlad, FYI: because Andrey asked me to, in the spirit of being a good citizen, I went ahead and created a SnakeYAML ticket with repro.