bible-technology / scripture-burrito

Scripture Burrito Schema & Docs 🌯
http://docs.burrito.bible/
MIT License
21 stars 13 forks source link

USFM Extension to Scripture Text #301

Open jonathanrobie opened 1 year ago

jonathanrobie commented 1 year ago

The USFM / USX Technical Committee would like us to give guidance for defining extensions for Scripture Text.

The Technical Committee wants to define a way to declare conventions for converting visible characters to invisible ones, something that translation teams frequently do. For instance, ZWSP, bidi controls, soft hyphens, hard space, various kinds of spaces are frequently encoded with characters like ~ or / etc. The Technical Committee would like to know the best way to define, declare, and publish an extension that allows translation projects to explicitly declare the conventions they use for such purposes.

jag3773 commented 1 year ago

One idea here would be to use an x-role to define an ingredient's file that contains this information, see https://docs.burrito.bible/en/latest/schema_docs/role.html?highlight=x-role .

jonathanrobie commented 1 year ago

One idea here would be to use an x-role to define an ingredient's file that contains this information, see https://docs.burrito.bible/en/latest/schema_docs/role.html?highlight=x-role .

I like that. But the question then becomes how USFM/USX should define, declare, and publish the format for this file. I assume that USFM/USX should do that, but we need to know how.

jonathanrobie commented 1 year ago

The first step for USFM is to define the file format that defines this. We will then discuss whether to support this using role or x-role in Scripture Burrito.

FoolRunning commented 1 year ago

I would like to throw out there that this probably shouldn't be done at all. It would make more sense if the USFM files put inside a SB already had the non-USFM (non-Unicode?) data removed/replaced. Adding in a file that describes how users worked around limitations in the software they were using seems wrong. I would expect the USFM files to be Unicode-ready (i.e. there shouldn't be a need for other software consuming the SB to deal with the limitations of other software). To me, this feels akin to hacked fonts.

Some things like ~ and // I think are actually defined by USFM and should be valid as-is.

jonathanrobie commented 1 year ago

I would like to throw out there that this probably shouldn't be done at all. It would make more sense if the USFM files put inside a SB already had the non-USFM (non-Unicode?) data removed/replaced. Adding in a file that describes how users worked around limitations in the software they were using seems wrong. I would expect the USFM files to be Unicode-ready (i.e. there shouldn't be a need for other software consuming the SB to deal with the limitations of other software). To me, this feels akin to hacked fonts.

Some things like ~ and // I think are actually defined by USFM and should be valid as-is.

I agree in theory.

In practice, that means that editors like Paratext would have to:

  1. Provide visible characters and ways to type them in, and
  2. Convert them to the appropriate invisible characters when saving (or at the very latest, when creating a Burrito)

Does that seem like something that is likely to happen if we ask? If not, I think users will keep using these workarounds and they should be defined somewhere.

FoolRunning commented 1 year ago

Well, assuming the application (Paratext in this case) needs to create the file that needs to be in a SB, that means that said application is going to have to have a way to know the information (i.e. right now, it's "understood" by a project team and is probably fixed at publishing time - Paratext currently has no knowledge of these substitutions). This means that during import/export to/from a SB, it could very easily make the substitutions - there isn't a need to provide ways to "see them" in the UI nor does it have to exist on disk outside of a SB in that format.

Basically, if the application needs to generate the file and thus must have the substitution information stored in some form, then it could just as easily make the substitutions with that information when creating the USFM files for the SB. Just my 2¢.