FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
314 stars 136 forks source link

Remove "final" modifier from `CBORParser` #129

Closed jinzha closed 6 years ago

jinzha commented 6 years ago

Hi CBOR developers,

I would use the CBOR lib in our project, but want to extend it to add some additional functionality. It looks like CBORParser defines a bunch of protected members which implies it could be extended, but it is "final" class, so can not be extended.

Is it intended to not be extended? If not, can you please fix this by removing the final modifier? BTW, the CBORGenerator is not final class.

cowtowncoder commented 6 years ago

No, CBORParser is not meant to be extended, although there is no strict need to have it be final. I can remove that easily.

Out of curiosity, do you have specific use for sub-classing?

cowtowncoder commented 6 years ago

I remove final modifier from CBORParser -- I can not recall explicit reason to add it, and no other parser or generator implementation is left final (at an earlier point some were). This is for 2.9 branch and master, so will be included in 2.9.5 release once that is made (and eventually 3.0).

jinzha commented 6 years ago

Hi cowtowncoder,

Thanks for quickly fixing it!

In our case, we would add a method to make CBORParser read from the specified offset of InputStream, the underling InputStream in our project support to read from a specified offset. I tried add below method to support this. Does it make sense to you?

public class CBORParserEx extends CBORParser { public void setOffset(int offset) { final ByteInputStream in = getInputStream(); in.setOffset(offset); _currInputProcessed = offset; _inputPtr = _inputEnd = 0; } }

jinzha commented 6 years ago

HiTatu,

Thanks for quickly fixing it! May I know when will the 2.9.5 be released?

In our case, we would add a method to make CBORParser read from the specified offset of InputStream, the underling InputStream in our project support to read from a specified offset. I tried add below method to support this. Does it make sense to you?

public class CBORParserEx extends CBORParser { public void setOffset(int offset) { final ByteInputStream in = getInputStream(); in.setOffset(offset); _currInputProcessed = offset; _inputPtr = _inputEnd = 0; } }

Regards, Jin

On 3/2/18 12:36 PM, Tatu Saloranta wrote:

I remove |final| modifier from |CBORParser| -- I can not recall explicit reason to add it, and no other parser or generator implementation is left final (at an earlier point some were). This is for |2.9| branch and |master|, so will be included in |2.9.5| release once that is made (and eventually 3.0).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D369817833&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=_2rUppWNPkdxJ73wkPkJySUCWTQ2RzoEFd5Z1LP_Z24&s=R0K7evEltPIgukHTgvRanLo-FpgHjUDfws-Tq2ZGHCg&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsR7DC1XHvoHItBogGmPVaYJXvcIYks5taMxLgaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=_2rUppWNPkdxJ73wkPkJySUCWTQ2RzoEFd5Z1LP_Z24&s=qJHOyMYTymY333y1pJncEIH-jrdGDN1bAUDN-xp0Zjc&e=.

cowtowncoder commented 6 years ago

On modification: I would personally modify it by custom InputStream instead, and not try to do that on parser. Or, if you prefer, sub-classing CBORFactory and handling it here. The only thing that would not achieve would be _currInputProcessed, and if that is important, the only way would be sub-classing.

As to 2.9.5, likely some time in March, maybe in 2 weeks or so. Usually there's at least 2 months between patch releases at this point.

jinzha commented 6 years ago

Hi Tatu,

Thanks for your quick response!

In our case, there is a custom InputStream which supports to set read offset at any position, but it seems that the CBORParser has its own internal InputBuffer that caches a batch of data read from InputStream, so adjusting the offset of InputStream will not immediately impact on CBORParser, it will read from its InputBuffer until no more data and need to load more from InputStream.

Is it possible to avoid using InputBuffer in CBORParser?

Regards, Jin

On 3/2/18 11:45 PM, Tatu Saloranta wrote:

On modification: I would personally modify it by custom |InputStream| instead, and not try to do that on parser. Or, if you prefer, sub-classing |CBORFactory| and handling it here. The only thing that would not achieve would be |_currInputProcessed|, and if that is important, the only way would be sub-classing.

As to |2.9.5|, likely some time in March, maybe in 2 weeks or so. Usually there's at least 2 months between patch releases at this point.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D369957817&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=vmCRLNuXjGHgq0-DUE4UH5s_WF4sQkqYRxdAv4QJPRI&s=9yT4Dk-Bme9uMAPwV86mB0dRO3MmZPK-mmKesPjqYMs&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsVHSuSD8PFB4SZETJBcRGapthjb0ks5taWkLgaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=vmCRLNuXjGHgq0-DUE4UH5s_WF4sQkqYRxdAv4QJPRI&s=ZfQeYRl4xGh1FJMzBpWBlbK8O7PZHsuwFjVFX7aMrN4&e=.

cowtowncoder commented 6 years ago

Use of input buffer is pretty essential for reasonable performance -- overhead of single byte reads would be significant. So design does not allow skipping of buffering. It is possible to "flush" content buffered but not (yet) read at end of decoding (using method releaseBuffered() on parser).

But I guess I am not sure which aspect of buffering would be problematic for you: location, or something else?

jinzha commented 6 years ago

Hi Tatu,

On 3/5/18 12:16 PM, Tatu Saloranta wrote:

Use of input buffer is pretty essential for reasonable performance -- overhead of single byte reads would be significant. So design does not allow skipping of buffering. It is possible to "flush" content buffered but not (yet) read at end of decoding (using method |releaseBuffered()|on parser).

But I guess I am not sure which aspect of buffering would be problematic for you: location, or something else?

Yes. it is location. In our case, the capability to read value in a customized order is required. So we just go through the InputStream to collects the offsets for each value, then read values in a customized order by setting the offset of InputStream.

Based on my performance testing, using Jackson CBOR for above case is 12% slower than using serialization module implemented by ourselves.

I feel like the way we use Jackson CBOR to read value on specified offset is not suitable for the current Jackson CBOR implementation that uses a internal buffer to cache data. In our case, the process is like set offset1 -> parse value1 -> set offset2 -> parse value2 -> .... When setting offset, actually it resets the InputBufer(see below setOffset() mehtod). During parse value, it will load 8K data to InputBuffer, but parse 1st value only, then reset again, load another 8K data and parse 2nd value, this process might not be that efficient I think. What do you think?

public class CBORParserEx extends CBORParser { public void setOffset(int offset) { final ByteInputStream in = getInputStream(); in.setOffset(offset); _currInputProcessed = offset; _inputPtr = _inputEnd = 0; } }

Regards, Jin

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D370305528&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=Ju-E8nByOjLWwY9HroVs_pYLQL-pz-ZRbry44K8Ff3Q&s=iSpwPX5urwNp7mtVCNvBUewh-F6VZWzvT8wLpgt8gEI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsY4Ht9wG7DnGS7cQKyNAH2ERv0O3ks5tbLwNgaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=Ju-E8nByOjLWwY9HroVs_pYLQL-pz-ZRbry44K8Ff3Q&s=nPE2Kfn_rGm2ywmwU9iahU1xDhu8WYAMtmPClBkrh9U&e=.

jinzha commented 6 years ago

Hi Tatu,

Thanks for your response as always!

On 3/5/18 12:16 PM, Tatu Saloranta wrote:

Use of input buffer is pretty essential for reasonable performance -- overhead of single byte reads would be significant. So design does not allow skipping of buffering. It is possible to "flush" content buffered but not (yet) read at end of decoding (using method |releaseBuffered()|on parser).

But I guess I am not sure which aspect of buffering would be problematic for you: location, or something else?

Yes. it is location. In our case, the capability to read value in a customized order is required. So we just go through the InputStream to collects the offsets for each value, then read values in a customized order by setting the offset of InputStream.

Based on my performance testing, using Jackson CBOR for above case is 12% slower than using serialization module implemented by ourselves.

I feel like the way we use Jackson CBOR to read value on specified offset is not suitable for the current Jackson CBOR implementation that uses a internal buffer to cache data. In our case, the process is like set offset1 -> parse value1 -> set offset2 -> parse value2 -> .... When setting offset, actually it resets the InputBufer(see below setOffset() mehtod). During parse value, it will load 8K data to InputBuffer, but parse 1st value only, then reset again, load another 8K data and parse 2nd value, this process might not be that efficient I think. What do you think?

public class CBORParserEx extends CBORParser { public void setOffset(int offset) { final ByteInputStream in = getInputStream(); in.setOffset(offset); _currInputProcessed = offset; _inputPtr = _inputEnd = 0; } }

Regards, Jin

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D370305528&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=Ju-E8nByOjLWwY9HroVs_pYLQL-pz-ZRbry44K8Ff3Q&s=iSpwPX5urwNp7mtVCNvBUewh-F6VZWzvT8wLpgt8gEI&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsY4Ht9wG7DnGS7cQKyNAH2ERv0O3ks5tbLwNgaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=Ju-E8nByOjLWwY9HroVs_pYLQL-pz-ZRbry44K8Ff3Q&s=nPE2Kfn_rGm2ywmwU9iahU1xDhu8WYAMtmPClBkrh9U&e=.

cowtowncoder commented 6 years ago

In general intent is not to modify underlying InputStream, but to construct separate CBORParser instances for distinct input documents. This is safer and overhead for parser instances should be modest.

But your use case where random-access is required is quite different from kinds of use cases I have considered. I can see how parser, as is, would have significant overhead.

If you do know the length of document (or document fragment) to parse you can prevent InputStream from exposing more content than is needed: so, for example, if document is known to be 400 bytes long, only return up 400 bytes. That would avoid significant overhead of reading 8000 bytes, discarding 7600.

jinzha commented 6 years ago

Hi Tatu,

Thanks for the analysis!

In our cases, the length of documents might be variable I think, it looks like our random-access case is not good case using the existing Jackson CBOR.

Regards, Jin On 3/7/18 12:39 PM, Tatu Saloranta wrote:

In general intent is not to modify underlying |InputStream|, but to construct separate |CBORParser| instances for distinct input documents. This is safer and overhead for parser instances should be modest.

But your use case where random-access is required is quite different from kinds of use cases I have considered. I can see how parser, as is, would have significant overhead.

If you do know the length of document (or document fragment) to parse you can prevent |InputStream| from exposing more content than is needed: so, for example, if document is known to be 400 bytes long, only return up 400 bytes. That would avoid significant overhead of reading 8000 bytes, discarding 7600.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D371020541&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=vy9JlEKneMWD-9v0G0FoJ685qBDtl1u-z1Pj-_I0CU4&s=2UBfMbMf7L-hg7YQFMEs--47yvoGmG4UTo43GbY2d0o&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsUiBVFN0mBOLWXlAYO8HMHO78TBVks5tb2R8gaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=vy9JlEKneMWD-9v0G0FoJ685qBDtl1u-z1Pj-_I0CU4&s=V7RFZ6dHmn8o7DgmJ4pwkvqRmA2VVUggLpQ7MuTLKTw&e=.

cowtowncoder commented 6 years ago

@jinzha Yes, it sounds like this is the case unfortunately. Optimizing for such access is tricky -- although something like this is actually implemented in Jackson JSON parser for DataInput. It results in -40% performance decrease (which may be more than what would happen with InputStream; perhaps DataInput access pattern is more difficult for JVM to optimize), but does work with no look-ahead/read-ahead (since DataInput does not have means to detect end of input; reader must know how much is still needed). Tests are in https://github.com/FasterXML/jackson-benchmarks.

Similarly, async parsers (for JSON, Smile; perhaps in future for CBOR) use different access pattern as input is fed, not read via blocking stream.

So it would be possible to implement no-caching parser; and make that pluggable via CBORFactory. I don't have time to do that myself, but if anyone wants to try that in future, I can help with integration.

For now your best bet may be to something else that exists, esp. if you have your own decoder.

jinzha commented 6 years ago

Hi Tatu,

Thanks you very much!

It is very nice experience to use Jackson CBOR library. I think you are right, we probably have to keep to use own decoder for now.

Regards Jin

On 3/8/18 2:33 AM, Tatu Saloranta wrote:

@jinzha https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jinzha&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=8HdcAxPS-dklEv4qsFn2_tCsPJsDq8zqc_Du0Ptdf_k&s=Z4FJXJN0_zo4WhdUImK8tBa9SUHhr49ov0LR-6Db1mg&e= Yes, it sounds like this is the case unfortunately. Optimizing for such access is tricky -- although something like this is actually implemented in Jackson JSON parser for |DataInput|. It results in -40% performance decrease (which may be more than what would happen with |InputStream|; perhaps |DataInput| access pattern is more difficult for JVM to optimize), but does work with no look-ahead/read-ahead (since |DataInput| does not have means to detect end of input; reader must know how much is still needed). Similarly, async parsers (for JSON, Smile; perhaps in future for CBOR) use different access pattern as input is fed, not read via blocking stream.

So it would be possible to implement no-caching parser; and make that pluggable via |CBORFactory|. I don't have time to do that myself, but if anyone wants to try that in future, I can help with integration.

For now your best bet may be to something else that exists, esp. if you have your own decoder.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_FasterXML_jackson-2Ddataformats-2Dbinary_issues_129-23issuecomment-2D371237813&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=8HdcAxPS-dklEv4qsFn2_tCsPJsDq8zqc_Du0Ptdf_k&s=gFleEk-Z7PjcThg1kD38du7vcdaQVRq7oKif32br5Mo&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHahsdw4UzXEmQ7IGx6mzLYsx4-2D0ofNzks5tcCgNgaJpZM4SZWDC&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wb6jsjsd3i3-fQugULtSmsNMCoqysPMHv94nEiI-Jv8&m=8HdcAxPS-dklEv4qsFn2_tCsPJsDq8zqc_Du0Ptdf_k&s=sammE3GII5sWdKcHQY8ejLBbGHPk3td8dE7BAcznWHY&e=.