[RFC0028] Make OP-Data more accessible

Work Planning

Details

- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)

Housekeeping

ALL BELOW FIELDS ARE REQUIRED

Named Concepts

- OP formats - OP formats are a few formats aiming at representing any text and markup in their simplest possible form. They are in-house, single common format offering a lot of flexibility and thremendous power. `.opf` files encode text and markup as plain text + offset annotations; `.opa` encodes the "inner" relationship between the content of several files; `.opc` encodes the "outer" relationship of several texts. Files in OP formats are generated by converting any file format into the common/single formats. The OP formats are the equivalent of the extended .md format used by conversoin tools like pandoc. (bardo, middleman, sandwich). OP formats always include the source files (as a GitHub release) and the link to the version of the script used to generate them. - Views - representations of the OP-Data in formats releveant to end-users. These are typically inline markup formats such as html, .docx etc. Views can be included in any of the OpenPecha repositories outside of the .opx folders. Views can include any number of layers. Views may include a metadata file with a link to the script used to generate them. - `.opf` - the format of OpenPecha files. Opfs consist of a folder with the extension `.opf` containing 3 basic components: 1. one or many base text in `.txt` format; 2. metadata **describing the text** in a `.yml` file; 3. annotations **describing the content of the text** in '.yml' layers files (1 file per annotation type). `.opf` files come in 3 flavors: initial pechas prefixed with an I#, diplomatic pechas prefixed with a D#, open pechas prefixed with an O#. - `.opa` - the format of OpenPecha alignments with a A# prefix in their names. These files consist of a folder with the extension `.opa` containing 2 basic components: 1. an alignment `.yml` file mapping the connection between the content of two or more `.opf`s (i.e. phrase level mapping of a text and its translation, root text and its commentary, phrases and A/V); 2. metadata **describing the text** in a `.yml` file - `.opc` - the format for OpenPecha collections. Prefixed with a C#. These files consist of a `.opc` folder containing 1. a metadata `.yml` file and 2. a collection index `.yml` file detailing the text spans and the layers, and the views making up the collection. The opc repos usually contain one or more views. Note: the current collection

Summary

This project will document OP Data so that people can easily understand the OPF format and what is in the OP Data repos. This will help them easily scan any repo, get what they need and use the data. Per the RFW, this project has two audiences: those who want to understand what OP Data is and those who want to download datasets. The first group might include patrons or supporters who won't use the data but want to understand it before funding projects or signing off on projects. They might also be decision-makers at organizations that will use the data, like CEOs of publishing companies, whose programmers will use the data. The second group includes people who will download datasets, including publishers and academics, as well as software developers, who will use the data to train AI models.

Reference-Level Explanation

Proposed updates to the **OP website**: - **Pecha Data landing page**: introducing the datasets to newcomers and easing users to the data they came for - **Reference page** with the following sections: - OPF format reference material, folder structure, annotations, file formats, etc. conventions - All annotations, kinds, format, etc. - All parsers (any --> op formats) - All serializers (op formats --> any) - Getting started guide for a typical use case that involves looking for a text and creating a view using the serializers. - **Data page** with a search bar that accesses all OP-Data repos in GitHub so users can search for individual etexts/repos. This should be doable using the GitHub API. For example, searching for `org:OpenPecha-Data "current data" in:readme` in the GitHub search bar returns the only repo that contains the string `current data` in its README, which is the Collections repo. Test it here: https://github.com/search?q=org%3AOpenPecha-Data+%22current+data%22+in%3Areadme. If the READMEs contain the info in the proposed README template below, users could search for etexts by title, BDRC IDs, etc. right on the OP website without devs having to make changes when the repos change. Proposed updates to the **OP Data GitHub repo templates** (to be added to all repos): - Informative and consistent READMEs for all OpenPecha Data - Automatic update of specific repo data when the repo is created or changed using GitHub Actions - Tibetan title unicode and Wylie - OP ID - BDRC version IDs - Link to version and/or work on BDRC page - List of layers - Views? - Short boilerplate sections about OP data in general, with links to the OP website for reference added to repo template **Known challenges** - Creating a search bar: I will need a devs help creating the search bar. - Adding READMEs to 13,000 repos: - Will need to use GitHub Actions and scrape info from the meta.yaml files within each repo to populate the READMEs with the information above (title, OP ID< BDRC ID, etc.)

Alternatives

Putting the catalog of etexts/repos in a markdown table on the website so people could search on the page. This would be unwieldy. There are also too many rows to make sorting very useful. The markdown table would also need to be updated, which would have to be done manually, I think.

Rationale

Directly searching the READMEs in the repos or the meta.yml file would be easier since it would always be up to date. It also wouldn't require users to scan a large table.

Drawbacks

No drawbacks, except that adding READMEs to existing repos and creating a search bar that searches OP-Data repos might take some dev time.

Useful References

*Describe useful parallels and learnings from other requests, or work in previous projects.* https://diataxis.fr This shows the rationale of keeping different sections of the documentation separate, e.g., reference material, getting started guides, etc. so that users can easily find it. https://docs.github.com/en/rest/search?apiVersion=2022-11-28#about-search The GitHub API docs on search: > A query can contain any combination of search qualifiers supported on GitHub. The format of the search query is: ``` SEARCH_KEYWORD_1 SEARCH_KEYWORD_N QUALIFIER_1 QUALIFIER_N ``` > For example, if you wanted to search for all _repositories_ owned by `defunkt` that contained the word `GitHub` and `Octocat` in the README file, you would use the following query with the _search repositories_ endpoint: ``` GitHub Octocat in:readme user:defunkt ```

Unresolved Questions

Whether it is possible to add the GitHub search bar to mkdocs.

Parts of the System Affected

This project isn't software, so it won't affect any software system, but it will affect the OpenPecha-Data section of the OP website.

Future possibilities

This section of the documentation will need to be updated and expanded when there are changes to the data so the documentation remains up-to-date and accurate. For example: - When datasets change - When new datasets are added - When changes are made to the OPF format - Etc.

Infrastructure

There may need to be a change to the website to accommodate a database search.

Testing

Stakeholders read the section on OP-Data, check their understanding, and provide feedback.

Documentation

This is a documentation request, so the result itself will be the needed documentation

Version History

v.1

Recordings

None

Work Phases

Research

[ ] Interview stakeholders to fill in an empathy map to better understand our audience
[ ] Create a persona for newcomers/supporters
[ ] Create a persona for data users
[ ] Identify the zero and hero state of each audience
[ ] Create a zero to hero user journey map (aka customer journey map) for each of the two target audiences Implementation
[ ] Identify the journey steps that can be cut out or improved
[ ] Identify the improvements (Better access points? Clearer information? More documentation?)
[ ] Improve the journey with better data access points and documentation where it is needed
[ ] Test the improvements with users

Planning

[ ] RFC completed on: Estimated time: Actual time:
[ ] RFC reviewed and approved by: Estimated time: Actual time:

Implementation

[ ] Identify the journey steps that can be cut out or improved
[ ] Identify the improvements (Better access points? Clearer information? More documentation?)
[ ] Improve the journey with better data access points and documentation where it is needed
[ ] Test the improvements with users

Completion

[ ] Documentation approved @ngawangtrinley Estimated time: Actual time:

OpenPecha / Requests