Integration with schema_analysis crate

QuartzLibrary commented 3 years ago

Hello!

First things first: thank you for the great library!

I am the author of schema_analysis, a crate that allows anyone to infer the schema of any self-describing format using Serde. Our projects are somewhat similar, but with different focuses and strengths, which is why I have tried to integrate the two together.

As mentioned here, I would be interested in working out a way to more seamlessly access the internals of json_typegen_shared to link directly to your library instead of the somewhat ugly workaround of forking and adding pub to what I need.

To clarify the context: I currently expose access to the shape, options, and generation modules in a fork of json_typegen_shared. You can see here how I am using them.

If you were interested, I would also be happy to use directly a codegen function that takes a Shape directly, along the lines of this one. This would also reduce my imports to the shape and options modules.

In any case, thank you again for your OS work!

Best, Quartz

evestera commented 3 years ago

Hi,

Saw your fork before you made the main repo public, and was wondering what you were planning to do with it.

I was initially quite skeptical of expanding the API of the crate, since I know breaking changes will have to happen, but as long as you understand it's still somewhat internal API, I guess it's better than a fork. And you are not the first to ask about using the json_typegen_shared crate, so I'm already avoiding unnecessary breaking changes anyway.

I've exposed Shape and added a codegen_from_shape you can use and Options was already exposed so hopefully that should be all you need to avoid the fork?

I've actually been working on replacing the serde_json parsing with doing the inference straight on a stream of tokens from a custom lexer, so I'm actually quite intrigued by your approach of using the serde visitor API. Especially since I have also considered adding support for more input formats, and just using serde is probably a better approach than custom lexers. Have you tried it on any large files yet (e.g. ~1GB of JSON)?

QuartzLibrary commented 3 years ago

Saw your fork before you made the main repo public, and was wondering what you were planning to do with it.

Yes, I didn't want to bother you if it turned out I had to stop for some reason, plus getting it working from the fork allowed me to be very specific.

but as long as you understand it's still somewhat internal API, I guess it's better than a fork.

I totally understand, and I don't mind breaking changes especially if semver compatible. But in any case I can jump in to fix anything that comes up.

that should be all you need to avoid the fork?

Yes, those changes are sufficient for the use-case, I have pushed a branch that relies directly on this repo: https://github.com/QuartzLibrary/schema_analysis/tree/json_typegen This will also fix an issue with docs.rs not liking remote git dependencies when it makes it to crates.io.

I've actually been working on replacing the serde_json parsing with doing the inference straight on a stream of tokens from a custom lexer, so I'm actually quite intrigued by your approach of using the serde visitor API. Especially since I have also considered adding support for more input formats, and just using serde is probably a better approach than custom lexers.

Glad you found it intriguing! From what I have found, Serde works really well for this. Very few formats (like toml with dates) provide more info than the simple Serde visitor anyway, so I don't think you could gain much from something custom. The main disadvantage with the approach is that any non-shape analysis has to be mostly restricted to the leaf nodes (structs and lists get only a list of field names and length, respectively), but in most case that is not really an issue.

Have you tried it on any large files yet (e.g. ~1GB of JSON)?

I have tried it with:

a ~180MB json file which took ~20s (wasm), ~5s (locally).
a ~650MB json file which took ~150s (wasm), ~50s (locally).
a ~1.7GB json file which took ~470s (wasm), ~145s (locally).
a ~2.1GB json file which took ~182s (locally). This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it, but I'm not sure.
a ~2.7GB compressed tar ball with ~200k xml files which took ~810s locally. This one was really a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (the json pretty printed serialised schema was almost 0.5GB!)

It could probably get faster if we didn't run the secondary analysis/sampling (which you can't yet do).

And of course, all numbers are indicative. Just a ballpark of how it runs on my laptop. Plus all are with the files fully in memory, having to buffer would likely slow things down.

evestera commented 3 years ago

Sorry, seems I completely forgot to make an actual release once I had made the necessary changes. 😬 I've released 0.7.0 now. I also made another change that may be of interest to you: Option parsing is now an optional feature, so if you use default-features = false you should drag in less transitive dependencies now.

Closing this issue for now, but of course feel free to add further comments here or open another issue if there are other changes that would be useful. 🙂

evestera / json_typegen

Integration with schema_analysis crate #28