dbuenzli / uutf

Non-blocking streaming Unicode codec for OCaml
http://erratique.ch/software/uutf
ISC License
30 stars 15 forks source link

Substring handling #6

Closed raphael-proust closed 8 years ago

raphael-proust commented 9 years ago

I nuked the previous PR because I don't know how to use git properly. So here it is again, in one single, clean, well-indented patch.

Closes #4.

dbuenzli commented 9 years ago

Thanks. That looks clean. One question though do you really need the `Substring` source ? I'd prefer to hold that until I have considered #2. (If the answer is no, don't bother I'll split the patch myself).

raphael-proust commented 9 years ago

I have a work around for now:

let decoder c =
  let d = Uutf.decoder ~encoding:`UTF_8 `Manual in
  Uutf.Manual.src d c.content c.pos c.len;
  d

And then I treat Await as End. It's not as clean as Substring, but it is acceptable.

Alternatively, String could carry two optional integer values.

Alternatively, there could be a way to signal a Manual decoder that there won't be anymore calls to Manual.src (and thus to return End instead of Await). Something like Manual.seal or Manual.terminate.

dbuenzli commented 9 years ago

Le jeudi, 5 février 2015 à 09:03, Raphaël Proust a écrit :

And then I treat Await as End.

This is wrong you have to terminate the manual source properly as documented (there could be a truncated character at the end and you will miss a `Malformed).

It's not as clean as Substring, but it is acceptable.

Yes I forgot about that – actually I'm pretty sure I didn't include offsets in `String because I thought you could use that at the time. Seems good enough to me, I'd like to avoid complexifying the api too much. I don't mind more work for clients that are not in the average use case.

Alternatively, there could be a way to signal a Manual decoder that there won't be anymore calls to Manual.src (and thus to return End instead of Await). Something like Manual.seal or Manual.terminate. More complex w.r.t. api, documentation and implementation.

Daniel

raphael-proust commented 9 years ago

Manual will work for me.

Although…

And then I treat Await as End.

This is wrong you have to terminate the manual source properly as documented (there could be a truncated character at the end and you will miss a `Malformed).

It looks (from the source, the documentation is not quite clear on that point) that calling Manual.src replaces the current source instead of adding on top of it. (After checking some bound properties, it executes d.i <- s which replaces the input string in the decoder record.) How does calling Manual.src lets you view the Malformed characters then.

dbuenzli commented 9 years ago

It looks (from the source, the documentation is not quite clear on that point) that calling Manual.src replaces the current source instead of adding on top of it.

I don't think the documentation should say something about this. It just tells you it will read from the string you provide. How it does this is none of your business.

How does calling Manual.src lets you view the Malformed characters then.

There's a temporary buffer that gets filled in if the byte sequence of a character overlaps two (or more) `Manually provided buffers, see this comment. If there's not enough data to decode a character the continuation fills this buffer in until it can decode the character.

raphael-proust commented 9 years ago

Ok.

I'll remove the Substring source in the PR and use Manual in my code.

dbuenzli commented 8 years ago

Thanks your patch is in as 1e7da8d796170284808752b