Closed mromanello closed 3 years ago
Yes, this should be better documented and/or thought out. To answer the question narrowly, dpi
in particular is optional. passim doesn't use it, although we use it elsewhere in our pipeline. The other annoyance of this schema is that we use Int types tp save storage, but spark's JSON loading reads all integer's as Long. For the moment, therefore, I'd only advise including these fields if you're loading from parquet and not JSON. So, yes, this could be done better.
Ok, and thanks David for the prompt reply!
Having a formal JSON schema (together with some valid examples) added to the repo may help users to prepare their data for passim (and reduce questions like this ;-) ). I can volunteer to contribute one of that's useful -- I'd probably have one anyway for our project.
Back to your comment about integers: does this affect all integer fields? I will leave out dpi
but still have integers for coordinate values. Since I'll be loading from JSON, will data simply be passed through, or will this affect passim's computation?
Hello --
I have a question concerning the JSON schema of passim input documents, especially the
pages
field.Are all fields required, or is it possible to leave out some of them (e.g.
dpi
)?