Closed marcospassos closed 4 years ago
First of all, thank you for contributing this! I did not realize that this path could be used during actual serializations, results not cached (if I had, would have raised the concern).
I have some concerns about implementation, added notes. I think those should be resolvable.
I was also wondering about where to add this (2.11 vs 2.12). I suspect there is value in getting it in earlier (2.11.3 -- altho I really hope to get 2.12.0.rc1 out within 1 month now), and maybe there is low risk for issues. So we can go with 2.11 branch as target.
Not sure how to tie the cache to the AvroFactory
or AvroMapper
. This method is transitively used across several classes, including value-objects not related to any of these instances. An example is AvroSchema
, instantiated directly in most cases.
I've tracked down all classes using these methods, and the good news is that only two classes use it directly: AvroReaderFactory
and AvroWriteContext
.
As this is a severe issue for us, I'm going to work on it. I just need some guidance as it does seem to fit in the current design.
Hi @cowtowncoder, any comments?
What about using a static WeakHashMap
where the class loader and the name cache are the keys and values respectively?
@marcospassos If I remember correctly, JDK offerings for weak/soft reference handling is unfortunately weak. Let me see if I can find a proper place for cache. This is the problem with static utility and other things without context; they can not really have state, including caching.
Ok, yes, this is very tricky. For a moment I thought much/most could be moved within AvroSchema
but unfortunately cursed Avro lib type Schema
is used instead so can not really stick it in there -- fundamentally all we need is a one-time decision of whether to use one full name or another, so if we could wrap Schema
in something, consistently (and hold on to references to that wrapper), no shared caching should be needed.
I hope to look more into this but I concur, this is badly implemented at this point & tricky to uproot. I wish I realized exactly how nasty this is -- now I understand why performance consequences are disastrous: failing to find a class can be VERY VERY expensive, esp. on bigger classpaths.
One thing that would be useful, if it was possible to obtain, would be to see specific path of calls, if profiler could give that information. There are quite a few callpaths to getTypeId(Schema)
and getFullName(Schema)
but I suspect they are not equally problematic.
One other thought: I am beginning to think that the "solution" in #195 is wrong -- anything that has to do class existence lookups seems like a Wrong Solution to a problem. So I am not ruling out a possibility of actually reverting that change.
That would not resolve the issue wrt 2.11 so perhaps static caching is a lesser evil, as long as size is bounded and both key and value are Strings. There is the need to use data struct that works in multi-threaded environment and com.fasterxml.jackson.databind.util.LRUMap
works well.
@cowtowncoder Thank you again for the detailed review and support. I've pushed a commit that makes use of LRUMap
instead of HashMap
. I didn't include tests because I don't see how we could test it, so I left it out.
Please let me know if you need anything else before merging this. We are looking forward to seeing it included as part of the next release.
@marcospassos thanks! I think I'll take this for 2.11, hoping things can be cleared up in future. :)
@marcospassos I did some changes mostly just to isolate cache bit further and use separate cache key type (if optimization is needed going all in :-D ). I realize that testing of this part is difficult, but it probably would be good if you could at least check that performance improvement is as expected (my main concern is that I managed to do something silly to render caching not work).
LGTM. We're going to monitor the IO metrics after the next release. I'll keep you posted.
anything that has to do class existence lookups seems like a Wrong Solution to a problem.
While I am in 100% agreement with this statement, Avro defined this behavior for matching nested classes in their reference implementation. :(
It's a choice between not being compatible with nested classes when using an Apache 1.9 schema, or doing class existence lookups.
... and it looks like Apache also handled this problem by just caching the Schema -> resolved full name as well :/
The "Right" Solution would be to emit a java-class
(AvroSchemaHelper.AVRO_SCHEMA_PROP_CLASS
) schema property on record schemas, and check/use that before doing a class lookup (Or only emitting that for nested classes, and assuming if it doesn't exist and the namespace doesn't exist, then it's a normal name, avoiding a class lookup in all cases); If you're in an ecosystem where you're using Jackson to generate the schemas and deserialize the records, that would avoid the class lookups altogether.
The "Right" Solution would be to emit a
java-class
(AvroSchemaHelper.AVRO_SCHEMA_PROP_CLASS
) schema property on record schemas, and check/use that before doing a class lookup (Or only emitting that for nested classes, and assuming if it doesn't exist and the namespace doesn't exist, then it's a normal name, avoiding a class lookup in all cases); If you're in an ecosystem where you're using Jackson to generate the schemas and deserialize the records, that would avoid the class lookups altogether.
But it would introduce a non-standard attribute that can eventually break if Avro parser doesn't recognize it anymore, or am I missing something?
non-standard attribute
It's a standard attribute used everywhere except record/fixed/enum schemas (maps, arrays, large numbers, etc.), for some reason. The Apache parser wouldn't recognize it at all on records, so it'd just ignore it and fall back to class lookups if you're using Apache to deserialize avro records.
If Jackson emitted the property on every record/enum/fixed schema, you could maintain compatibility with deserializing all Apache 1.8 and 1.9 schemas, while optimizing the usecase where Jackson is being used both for schema generation and deserialization at the cost of bloating the schema size a bit. Every other combination would require fallbacks to class lookups (so we'd still need the cache).
The benefit is that it creates a potential path forward (if we could get Apache to adopt the same behavior) of moving back to a world where the general case doesn't require class existence checks (they would become a special case when deserializing from a Apache 1.9 schema)
In our case, we don't use schema generation, but I understand that it should work in the same way provided that we include the java-class
attribute in the Avro schema, right? If so, indeed, it does seem to be the right path.
That would be the idea in such a hypothetical scenario. You could retroactively generate the property for any 1.8 schemas (or 1.9+ schemas if you're 100% sure there are no nested classes in use) and update the stored schemas for your data to add the property to bypass class existence checks, without breaking compatibility with anything. Which is still a big mess, but it's a workable mess for someone that has stored historical data they can't throw out.
The other option is to look into adding some compatibility feature flags to the AvroMapper
whereby you could disable 1.9+ support to avoid the class existence checks if you're not using 1.9+ schemas.
If I understand the idea correctly, I like it: additional metadata to avoid lookups in best case; but fallbacks for other cases.
We use Jackson in our stream processing pipeline, and we noticed that the serialization process was consuming most of the CPU time. After some profiling, we tracked down the issue to the method which resolves the record's name. In the current implementation, for every record visited, the class loader checks if a class with the record's name exists, causing an IO bottleneck in high loads. By caching the result, we were able to boost our performance by 7000%.