Open michaelfruth opened 4 years ago
Hi Micahel,
Thank you for this issue.
I am not sure your description of the problem root cause is correct. Isn't your suggestion is what is being done here for string enums
and here for anyOf with several patterns
Hello, I noticed a performance problem as soon as the schema contains the following structure:
... "anyOf": [ {"enum": ["aa", "bb", "cc"]}, {"pattern": "pattern1"}, {"pattern": "pattern2"}, {"pattern": "pattern3"}, ... ] ...
The performance can be massively improved by processing the schema beforehand. All enum values and patterns should be combined to a single pattern as shown in the example below:
... "anyOf": [ {"pattern": "^aa$|^bb$|^cc$|pattern1|pattern2|pattern3"} ] ...
Actually, you iteratively append the enum values and regex patterns to a single regex and compute for every iteration the intersection between the current pattern and ".*". This is very expensive and results in bad performance (for this specific kind of schema).
I added an example json file (anyOf.json) that shows the problem. anyOf.json takes on my machine about 50-60 seconds for the result (LHS :< RHS and RHS :< LHS) when checking the file against itself (command
jsonsubschema anyOf.json anyOf.json
). Applying preprocessing, it takes about 0.04 seconds. I also attached a python script (smaller_anyOf.py) that contains the preprocessing. The script combines the string-enum-values and all patterns to a single pattern as shown in the example above.AnyOf.zip
By transforming the string-enum-values to a regex, special regex characters (e.g. ".", "-", ...) are escaped to get an identical expression as regex.
... "enum": ["ab-c"] ...
will be transformed to... "pattern": "^ab\\-c$" ...
Be careful, this can currently lead to another problem - see #6 .
Best Regards Michael