Output JSON Schema - Githubissues

kavitharaju commented 1 year ago

This PR

Tries to standardize the JSON/Dict output formats in python module. - #180 The previous format(in 3.x) was giving a simpler structure for simple USFMs and more nested structure if nesting is present. It has been modified as two output formats, both with consistent structure.
1. Nested JSON having the required complex nesting structure to support all possible maker usages in USFM
2. Flat JSON to give a maximum simple format at the cost of loss of structural information Sharing the current output for review JSON_refactor.md
Changes Grammar and Syntax Tree: Call the content of \usfm maker "version" explicitly

What is pending in this task

The JSON schema has not been defined for the above outputs
The to_list() conversion has to be re-written to work with the new Flat JSON (:warning: The tests failing in this PR are related to this)

cmahte commented 1 year ago

Looking at the various options, I don't see BCV-flat. The most useful "flat" option is going to be the way scripture is always referred to: Book Chapter:verse.

Now, in reality that means BICV (Book, Intro, Chapter, Verse) and also that means that some information in USFM is out of place: \s \r and all paragraph types belong to the \v that immediately follows them.

But for nesting that would be the most intuitive and useful.

Level 1 BookID (including \usfm \h \toc and \rem tags that occur before any \mt) Level 2 Intro (including \imt \mt \ip etc.) Level 3 Chapter (Including \c \ca \cp \cl \d) Level 4 Verse (including \p (etc) \r \s )

or

Level 1 BookID (including \usfm \h \toc and \rem tags that occur before any \mt) Level 2 Intro (including \imt \mt \ip etc.) Level 2 Chapter (Including \c \ca \cp \cl \d) Level 3 Verse (including \p (etc) \r \s )

in the second option intro chapters effectively become chapter zero. Whether that's required in JSON I don't think it is. but in OSIS and other languages, it is a convention to mark a chapter 0, and it helps the front end programs with displaying study materials in a better way than crammed above or into chapter 1 verse 1. And the second option is problematic with some aprocrypha books which have 'canonized' introductions before chapter 1, Sirach I believe has 14 verses. Separating the scripture from modern text becomes a problem if chapter zero is assumed non-canon, but there are scripture verses before the chapter 1 mark. If you have a separate level then both the apocryphal pre chapter 1 verses, and the "book" divisions in psalms are possible by marking the level 2 intro level then continuing with the chapter level.

kavitharaju commented 1 year ago

In the previous version of usfm-grammar(2.x) we followed a similar structure as what you suggest, following the intuition that a Book-Chapter-Verse structure is what is going to be used most.

Level 1 BookID (including \usfm \h \toc and \rem tags that occur before any \mt) Level 2 Intro (including \imt \mt \ip etc.) Level 3 Chapter (Including \c \ca \cp \cl \d) Level 4 Verse (including \p (etc) \r \s )

But in this version we are trying to keep our output structure as close to what is natural in USFM as possible, but still bring in the advantage of using more programmer friendly formats. The main difference from this, would be that, we don't put paragraphs(\p) under verse(\v), but verse under paragraph. Also section headings(\s) and related markers(\r) comes under chapter(\c). But rest of nesting is kept in the Nested JSON.

kavitharaju commented 1 year ago

Closing this as we will be moving to USJ

Bridgeconn / usfm-grammar

Output JSON Schema #202