adyeths / u2o

USFM to OSIS bible format converter.
The Unlicense
18 stars 6 forks source link

Would u2o.py correctly process a concatenation of all Bible books in USFM format #105

Closed DavidHaslam closed 2 years ago

DavidHaslam commented 4 years ago

Just a query.

It might save a step in earlier preprocessing of source texts from a more general format.

Is it possible for u2o.py to process all the Bible books concatenated as a single USFM file?

Or is it an absolute requirement for each Bible book to be in its own USFM file?

adyeths commented 4 years ago

Unless I'm mistaken, USFM requires all books to be in separate files. U2O as it's currently written is not capable of processing all books concatenated into a single file.

DavidHaslam commented 4 years ago

Even so, given that the start of a new book is triggered by the \id ### tag, surely it wouldn’t require a massive structural change to process a single file?

It would obviate always having to split my intermediate USFM output into 66 separate files before trying a module build from the resulting OSIS.

In other words, it could facilitate a convenient shortcut towards the final goal.

Nice to have, even though not essential, as well as deviating from a published standard.

alerque commented 4 years ago

It sounds like your intermediate 'USFM' may be the thing that isn't standards compliant here. I would rather see this tool work towards rather than away from the publish spec. It should take about 1 line of bash code to loop over an input file and call this separately for each chuck if that's what you really want to do.

DavidHaslam commented 4 years ago

When one is converting a source text from something that’s only marked with presentation styles, it makes sense to retain the output stages as a single file for the whole Bible.

It’s thus simpler to check for exceptions to the conversion rules. You only have to search or analyse a single file and hence to improve the bespoke conversion until you’re satisfied it covers everything it could possibly cater for.

That’s the general background to my question.

cmahte commented 4 years ago

I've gone over the spec and tested. and queried the keepers of the spec. The \id tag must be the first line of the file, but it does not necessarily have to appear only once.

I do concatenate files together and import them into SFM enabled programs. and with rare exceptions it works (exceptions being multiple multiple \id tags of the same USFM book ID may cause bugs.) When you have the same \id code with similar chapters in a single file, unpredictable results occur

That is \gen + \c 1 , \gen + \c 2 imports properly, but \GEN + \c 1 , \GEN + \c 1 imports, but you don't end up with 2 chapters, and it's not exactly clear which get's priority.

However, as long as the \id + \c doesn't create a duplicate chapter, I haven't had any errors or weird results in Paratext.

On Fri, Apr 24, 2020 at 9:07 AM Ryan notifications@github.com wrote:

Unless I'm mistaken, USFM requires all books to be in separate files. U2O as it's currently written is not capable of processing all books concatenated into a single file.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adyeths/u2o/issues/105#issuecomment-619032174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2DE4VRJF7IL35BZT4NRJDROGMKJANCNFSM4MP6WCJA .

DavidHaslam commented 4 years ago

It’s cool when software isn’t so rigid as to disallow something that can be from a useful way of working even when it’s not the final product.

adyeths commented 4 years ago

U2O was only ever intended to take valid USFM and convert it to valid OSIS. Anything beyond that is outside the scope of what this program is intended and designed to do. IF multiple books are allowed according to the spec, then I will change the behavior of U2O to allow for this. I will have to see documentation indicating as much. Otherwise, this isn't going to happen.

adyeths commented 2 years ago

I have added a wrapper script for u2o to handle processing of a concatenation of usfm files into a single usfm file.