First iteration of a vocabulary for recipe specs - Githubissues

bible-technology / scripture-burrito

Scripture Burrito Schema & Docs 🌯

http://docs.burrito.bible/

MIT License

21 stars 13 forks source link

First iteration of a vocabulary for recipe specs #162

Closed mvahowe closed 4 years ago

FoolRunning commented 4 years ago

I thought these would be much simpler (like a few booleans saying whether there is any common processing to be done - like removing footnotes). I feel like this is a new programming language to define recipe specs - which all implementers of SB will have to implement. 😣

What's the use-case for something this complex?

mvahowe commented 4 years ago

It's still missing a few features :-)

mvahowe commented 4 years ago

@FoolRunning I'm working on some explanatory notes. But the shortest answer is probably to say "What does the simpler solution that does everything everyone is going to want, in a portable way, look like?"

Booleans for processing USX is definitely part of what is needed, but that's just for processing individual USX files that don't change scope in the process. Scenarios that have gone by include

stripping down files like MAT.sfm into smaller files that contain arbitrary selections of BCV content (at which point the name MAT.sfm no longer works, especially if there are two arbitrary selections from MAT to be used in different places within the recipe)
building a potentially deep tree for the recipe. (PT doesn't currently do this because PT never upgraded its user model when DBL moved to metadata 2.x.)
in the case of generic recipeSpecs, handling all sorts of scenarios like books that are missing, books that are present but with corresponding introductions that are missing, headings that should depend on how much of the content is there, levels of nesting that should depend on how much of the content is there, localization fallback scenarios...

We may be able to come up with something simpler. But the main expressed fear to date has been that "recipes won't do everything I want". I feel that I have addressed this concern :-) Also, if we ship something simple that doesn't do what people need, those people have to do the equivalent of recipeSpecs some other way, at which point we don't have a standard at all.

I do suspect that we could come up with dietRecipeSpecs at some point. But I think it's easier to work back from a rich solution to something optionally simpler than to retrofit rich functionality to a set of booleans.

mvahowe commented 4 years ago

It may help to show how Nathanael currently does some of what recipeSpecs are intended to do.

This is the code that sets up metadata for a SL video entry (so roughly equivalent to my assembleRecipe step here). That code is about to get a lot more convoluted so that we can handle chronological translations as well as BCV ones.

class VideoWizard(Wizard):
    """
    A wizard for introspecting video bundles.
    The wizard looks for mp4 and/or mpg files, either as a flat directory or organized by directory named
    after PT books (GEN, EXO etc). The filename in both cases contains the book name plus the chapter/verse
    range, ie it starts with GEN or GEN_1 or GEN_1_2-5 or GEN_1_5-2_3.
    To attach names to files, add a name in the metadata names section with an id of
    section-bok_ch_v-v
    It may be possible to subclass video wizards to handle other naming schemes by overriding the regexes in
    __init__ and the _book_from_uri() method.
    """

    wide_video_regex = None
    range_regex = None
    narrow_video_regex = None

    def __init__(self):
        super(VideoWizard, self).__init__(
            "video",
            "Video bundle",
            "video"
        )
        self.wide_video_regex = "\\.mp[4g]$"
        self.range_regex = "(\\d+)(_(\\d+)(-(\\d+)(_(\\d+))?)?)?"
        self.video_reference = "([A-Z1-5]{3})(_(" + self.range_regex + "))?"
        self.narrow_video_reference_regex = "^([A-Z1-5][A-Z1-5][A-Z1-5]/)?{0}(\\+{0})*\\.mp[4g]$".format(self.video_reference)
        self.narrow_video_peripheral_regex = "^(peripherals/credits)|(peripherals/(sign|concept|place|name)(_\S+))\\.mp[4g]$"
        self.narrow_video_regex = "({0})|({1})".format(self.narrow_video_reference_regex, self.narrow_video_peripheral_regex)
        # print(self.narrow_video_regex)

    def _book_from_uri(self, uri):
        """
        Returns book name from uri
        """
        if "credits" in uri or "concept" in uri or "place" in uri or "sign" in uri or "name" in uri:
            return None
        else:
            return uri[:3]

    def _role_from_uri(self, uri):
        """
        Returns role from uri
        """
        bcv = BcvReference(path=uri)
        return bcv.reference_string()

    def assess(self, bundle_id, pub_id):
        """
        Returns a report on the goodness of fit of the wizard for the bundle and publication.
        """
        structure, metadata_dom, names, resources, canon_spec = self.collect_assess_data(bundle_id, pub_id)
        record = {
            "wizard": self.name,
            "uri": "",
            "hits": [],
            "misses": []
        }
        for resource in [r for r in resources.values() if re.search(self.wide_video_regex, r["uri"])]:
            try:
                resource_book = self._book_from_uri(resource["uri"])
            except:
                record["misses"] += [resource["uri"]]
                continue
            if resource_book and resource_book in canon_spec["books"] and re.search(self.narrow_video_regex, resource["uri"]):
                record["hits"] += [resource["uri"]]
            elif re.search(self.narrow_video_regex, resource["uri"]):
                record["hits"] += [resource["uri"]]
            else:
                record["misses"] += [resource["uri"]]
        return [record]

    def perform(self, bundle_id, pub_id, wizard, url):
        """
        Creates the structure for the bundle, based on the data at the url.
        """

        def _resources_by_book(resources):
            books = {}
            peripherals = []
            for resource in resources:
                resource_book = self._book_from_uri(resource["uri"])
                if resource_book is None:
                    peripherals.append(resource)
                if resource_book not in books:
                    books[resource_book] = []
                books[resource_book] += [resource]
            for book in books:
                books[book].sort(key=lambda x: x["uri"])
            return (books, peripherals)

        def _canonical_books(resources):
            books = set()
            for resource in resources:
                try:
                    bcv = BcvReference(path=resource["uri"])
                except:
                    continue
                for book in bcv.canonical_books():
                    books.add(book)
            return [b for b in books]

        def _peripheral_role(uri):
            peripheral_source = uri.split(".")[0].split("/")[-1]
            return re.sub("_", " ", peripheral_source)

        def _container_label(resources):
            found_mpg = False
            found_mp4 = False
            for resource in resources:
                try:
                    _, suffix = resource["uri"].split(".")
                except:
                    continue
                if suffix == "mpg":
                    found_mpg = True
                elif suffix == "mp4":
                    found_mp4 = True
            if found_mpg and found_mp4:
                return "mpg+mp4"
            elif found_mpg:
                return "mpg"
            else:
                return "MP4"

        self.check_bundle_medium(bundle_id)
        structure, metadata_dom, names, resources, canon_spec = self.collect_assess_data(bundle_id, pub_id)
        self.remove_existing_publication_structure(metadata_dom, pub_id)
        publication_element = metadata_dom.xpath("/DBLMetadata/publications/publication[@id='{0}']".format(pub_id))[0]
        structure_element = publication_element.xpath("structure")[0]
        canonical_element = publication_element.xpath("canonicalContent")[0]
        matching_resources = self.matching_resources(resources, url, self.narrow_video_regex)
        format_container_elements = metadata_dom.xpath("/DBLMetadata/format/container")
        if len(format_container_elements) > 0:
            format_container_element = format_container_elements[0]
        else:
            format_container_element = etree.Element("container")
            metadata_dom.xpath("/DBLMetadata/format")[0].append(format_container_element)
        format_container_element.text = _container_label(matching_resources)
        resources_by_book, peripherals = _resources_by_book(matching_resources)
        for book in canon_spec["books"]:
            if book in resources_by_book:
                if self.name_for_role(names, book) is None:
                    metadata_dom.xpath("/DBLMetadata/names")[0].append(self.new_name_element(book))
                book_division = self.new_division_element(
                    role=book,
                    name="book-{0}".format(book.lower())
                )
                structure_element.append(book_division)
                for resource in resources_by_book[book]:
                    resource_role = self._role_from_uri(resource["uri"])
                    new_element =\
                        self.new_content_element(
                            src=resource["uri"],
                            role=resource_role
                        )
                    if self.name_for_role(names, resource_role) is not None:
                        new_element.attrib["name"] = self.role_to_section_name_id(resource_role)
                    book_division.append(new_element)
        for book in _canonical_books(matching_resources):
            canonical_element.append(self.new_canonical_book_element(book))
        if len(peripherals):
            peripheral_element = self.new_division_element(role="peripherals", name="peripherals")
            peripherals_label = metadata_dom.xpath("/DBLMetadata/names/name[@id='peripherals']")
            if len(peripherals_label) == 0:
                name_element = etree.Element("name")
                name_element.set("id", "peripherals")
                short_name_element = etree.Element("short")
                short_name_element.text = "Peripherals"
                name_element.append(short_name_element)
                metadata_dom.xpath("/DBLMetadata/names")[0].append(name_element)
            structure_element.append(peripheral_element)
            for peripheral in peripherals:
                peripheral_id = re.sub(" ", "-", _peripheral_role(peripheral["uri"]))
                new_element =\
                    self.new_content_element(
                        src=peripheral["uri"],
                        role=_peripheral_role(peripheral["uri"]),
                        name=peripheral_id
                    )
                peripheral_element.append(new_element)
                existing_peripheral_label = metadata_dom.xpath("/DBLMetadata/names/name[@id='{0}']".format(peripheral_id))
                if len(existing_peripheral_label) == 0:
                    name_element = etree.Element("name")
                    name_element.set("id", peripheral_id)
                    short_name_element = etree.Element("short")
                    short_name_element.text = peripheral_id
                    name_element.append(short_name_element)
                    metadata_dom.xpath("/DBLMetadata/names")[0].append(name_element)
        self.storer.write_bundle_metadata(bundle_id, etree.tostring(metadata_dom))

mvahowe commented 4 years ago

And this is a Nathanael "mapper" that rewrites Scripture App Builder-style timing files into XML. (It's roughly equivalent to the processIngredients and copyIngredient aspects of my proposal. (There's a corresponding output mapper to turn that into JSON for another partner).

class TsvTimingInputMapper(InputMapper):
    """
    Maps TSV audio timing files to DBL XML format and directory structure.
    The name of the input files should end with '-timing.txt' and contain
    3-column tab-separated value output from SAB.
    """

    match_regex = None

    def __init__(self):
        super().__init__(
            "tsv_timing",
            "audio",
            "Timing Files from SAB in TSV Format"
        )
        self.match_regex = r"([A-Z1-5][A-Z]{2})-(\d{1,3})-timing.txt$"

    def matches_uri(self, uri):
        return re.search(self.match_regex, uri)

    def book_chapter(self, uri):
        matches = re.search(self.match_regex, uri)
        if matches is None:
            return None
        book_code = matches.group(1)
        chapter = str(int(matches.group(2)))
        return (book_code, chapter)

    def mapped_uri(self, uri):
        try:
            book_code, chapter = self.book_chapter(uri)
        except Exception as exc:
            self.storer.add_event(("error", "input_mapper", "tsv_timing", uri))
            return None
        return "release/timing/{0}/{0}_{1}-timing.xml".format(book_code, ("0" * (3 - len(chapter))) + chapter)

    def mapped_content(self, uri, content):
        try:
            book_code, chapter = self.book_chapter(uri)
        except Exception as exc:
            self.storer.add_event(("error", "input_mapper", "tsv_timing", uri, str(exc)))
            return None
        try:
            if type(content) != str:
                content = content.decode()
            timings_dom = etree.Element("timings")
            timings_dom.set("book", book_code)
            timings_dom.set("chapter", chapter)
            for line in re.split(r"[\n\r]+", content):
                if not re.search("\\S", line):
                    continue
                start_time, end_time, ref = line.split("\t")
                timing_element = etree.Element("timing")
                timing_element.set("start", re.sub("[^\\d.]", "", start_time))
                timing_element.set("end", re.sub("[^\\d.]", "", end_time))
                timing_element.set("canonical", "false" if ref.startswith("s") else "true")
                if not ref.startswith("s"):
                    timing_element.text = ref
                timings_dom.append(timing_element)
        except Exception as exc:
            self.storer.add_event(("error", "input_mapper", "tsv_timing", uri, str(exc)))
            return None
        return etree.tostring(timings_dom)

mvahowe commented 4 years ago

None of the above is rocket science, but there is quite a lot of detail and, in addition, that code is quite fragile insofar as it makes lots of assumptions about how users provide their content in the first place. I think we need a spec that lets us describe that level of detail. Otherwise we end up with every recipe consisting of one proprietary black box.

mvahowe commented 4 years ago

Some explanation of choices here: https://github.com/bible-technology/scripture-burrito/blob/43420d7bb9aa35a51cad79d0140bd66c9cd1b06e/docs/recipes/recipeSpec_proposal.md

mvahowe commented 4 years ago

Relates to #157

mvahowe commented 4 years ago

I'm still looking for a command line JSON formatter with some common sense but, via the online Prettier formatter, here are some of the more useful bits of my current example document:

        [
          "writeIngredient",
          ["getq", "jsonVariable"],
          ["lit", "processing"],
          ["lit", "/peripherals/versification.json"]
        ],
        [
          "copyIngredient",
          ["lit", "source"],
          ["lit", "/peripherals/versification.json"],
          ["lit", "derived"],
          ["lit", "/peripherals/source_versification.json"]
        ],
        [
          "processIngredient",
          ["lit", "/peripherals/versification.json"],
          ["lit", "versificationJsonToVrs"],
          ["lit", 0.3],
          ["lit", "/peripherals/source_versification.json"],
          [
            "json",
            {
              "addMissingPTBooks": true
            }
          ]
        ],
        ["copyName", ["lit", "book-mat"]],
        [
          "setq",
          "shortMAT",
          ["readName", ["lit", "book-mat"], ["lit", "short"]]
        ],
        [
          "writeName",
          ["lit", "book-mat"],
          ["lit", "short"],
          [
            "lit",
            {
              "en": "Matthew",
              "fr": "Mathieu"
            }
          ]
        ]
      ],
 [
          "setq",
          "mySection",
          ["newRecipeSection", ["lit", "/"], ["lit", "nt"]]
        ],
        [
          "setq",
          "mySectionElements",
          ["lit", "mat"],
          ["newRecipeElements", ["getq", "mySection"], ["lit", "mat"]]
        ],
        [
          "newRecipeElement",
          ["getq", "mySectionElements"],
          ["lit", "/release/MAT.mp3"]
        ],
        [
          "newRecipeElement",
          ["getq", "mySectionElements"],
          ["lit", "/release/MAT.timing.json"]
        ],
        ["message", ["lit", "Now delete object key just because"], ["lit", 4]],
        ["delq", "mySectionElements", ["lit", "mat"]]
      ]

mvahowe commented 4 years ago

I mocked up an identity transform here:

https://github.com/bible-technology/scripture-burrito/blob/1fa812d8f1c1ad20a5f46b521da2a85e35912afc/docs/examples/artifacts/ingredients/recipe_spec/identity_variant.json

See the "one long line because this is JSON" inline comments.

I think one thing this demonstrates is that adopting an array-based syntax would make everything far far more concise. But before I start tweaking syntax (given that I don't think expect anyone to edit this format directly anyway), does this give a better idea of the kind of problem we are trying to solve, and why we do need some control structures to make recipeSpecs at all useful?

mvahowe commented 4 years ago

I'm going to merge this into a new branch called develop_recipe_spec and do another round of experimentation.