dasmiq / passim

Detect and align similar passages
88 stars 15 forks source link

optional vs required fields in input JSON passim ? #5

Closed mromanello closed 3 years ago

mromanello commented 5 years ago

Hello --

I have a question concerning the JSON schema of passim input documents, especially the pages field.

case class Coords(x: Int, y: Int, w: Int, h: Int, b: Int) case class Region(start: Int, length: Int, coords: Coords) case class Page(id: String, seq: Int, width: Int, height: Int, dpi: Int, regions: Array[Region])

Are all fields required, or is it possible to leave out some of them (e.g. dpi)?

dasmiq commented 5 years ago

Yes, this should be better documented and/or thought out. To answer the question narrowly, dpi in particular is optional. passim doesn't use it, although we use it elsewhere in our pipeline. The other annoyance of this schema is that we use Int types tp save storage, but spark's JSON loading reads all integer's as Long. For the moment, therefore, I'd only advise including these fields if you're loading from parquet and not JSON. So, yes, this could be done better.

mromanello commented 5 years ago

Ok, and thanks David for the prompt reply!

Having a formal JSON schema (together with some valid examples) added to the repo may help users to prepare their data for passim (and reduce questions like this ;-) ). I can volunteer to contribute one of that's useful -- I'd probably have one anyway for our project.

Back to your comment about integers: does this affect all integer fields? I will leave out dpi but still have integers for coordinate values. Since I'll be loading from JSON, will data simply be passed through, or will this affect passim's computation?