Lotus-King-Research / Requests

Common repository for RFCs
0 stars 0 forks source link

RFC0003: Padma-Translate | Human Assisted Translation for Tibetan #3

Closed mikkokotila closed 1 year ago

mikkokotila commented 2 years ago

What is it?

Human-Assisted Translation for Tibetan (Dharma texts).

Named Concepts

Source Language is the language in which the material to be translated is in (i.e. Tibetan). Source Language will be used interchangeably with Tibetan.

Target Language is the language to which the source material is to be translated into (e.g. English).

Source Text is the body of text to be translated, in its Source Language form.

Target Text is the body of the text to be translated, in its Target Language form.

Source Segment is the segment of the Source Text, in its Source Language form.

Target Segment is the segment of the Target Text, in its Target Language form.

Source Phrase is a phrase within the Source Segment, in its Source Language form.

Target Phrase is a phrase within the Target Segment, in its Target Language form.

Source Word is a word within the Source Segment, in its Source Language form.

Target Word is a word within the Source Segment, in its Source Language form.

Translation Memory is a custom dictionary where each entry consists of a key (a phrase in Tibetan) and a value (translation of the phrase in a given Target Language).

Term Base is a custom dictionary where each entry consists of a key (a word in Tibetan) and a value (translation of the word in a given Target Language).

Project is the entity that constitutes the translation of a single Source Text.

Project Comment is a comment specific to a given project. Project Comment will be used interchangeably with Comment, except if Comment is the first word of the sentence.

How would it work?

The high-level workings of the system can be described in a sequence of events:

1) Start a new project or continue on already started one 2) System will break Source Text into Source Segments 3) System will use Source Segments to create Target Segments 4) System will attempt to replace Source Phrases with Target Phrases in Target Segments 5) System will attempt to replace Source Word with Target Words in Target Words

In the case of starting a new project, at this stage, the user is left with a graphical user interface where the screen is equally split into three columns; Source column, Target column, and Comments column. The functionality is described below in terms of the View, and each column individually.

In the case of continuing an already started project, the user is directly led to the current state of the project.

Views

There is only one view, the Main View described below.

Main View

In addition to the three columns - Source, Target, and Meta - the main view will consist of several toggles. A very rough sketch of the interface is provided below.

Screenshot 2022-01-09 at 17 10 30

Each of the three columns can be minimized, down to just one column is visible. Equally, columns can be resized in terms of their width. In both cases, the other columns will adjust in size automatically so that the whole screen is always filled horizontally.

Toggles

Source Column

Source Column has a selection for switching between different versions of the Source Text when available.

Area

The area is defined for the purpose of being able to select one or more consecutive Source Segments in the Source Column. When more than one Source Segment is selected, additional options will be available in the Context menu. These are always visible in the Context menu but grayed out unless more than one Source Segments are selected. These are highlighted with * in the Context section below.

Content

The content of the Source Column consists of the Source Text broken down into Source Segments.

Context

The context menu can be activated once any part (or the whole) of the text of the Source Segment is selected.

Hover

Show dictionary when toggle between showing dictionary results on hover is on.

Styling

Upon hovering the mouse on the leftmost part of the section, a circle icon will appear, clicking this will show the available styling options.

Target Column

Target Column has a selection for switching between different Target Languages when available.

Area

The area is defined for the purpose of being able to select one or more consecutive Target Segments in the Target Column. When more than one Target Segment is selected, additional options will be available in the Context menu. These are always visible in the Context menu but grayed out unless more than one Target Segments are selected. These are highlighted with * in the Context section below.

Content

The content of the Target Column consists of any mix of Source Text and Target Text (depending on how far it has been translated), corresponding with the Source Segment immediately to its left.

Context

The context menu can be activated once any part (or the whole) of the text of the Target Segment is selected.

Hover

Show dictionary when toggle between showing dictionary results on hover is on.

Styling

Upon hovering the mouse on the leftmost part of the section, a circle icon will appear, clicking this will show the available styling options.

Meta Column

Target Column has a selection for switching between Comments and Approvals

Comments

The default is Comments, where all Project Comments connected with a given segment are shown. This will work similarly to comments work in Google Docs with the possibility to resolve where the comment closes.

Approvals

Approvals are for reviewing and managing any system-wide changes (e.g. Target Word for a given Source Word changes in the underlying Term Base). This will work similarly to comments work in Google Docs with the possibility to resolve where the comment closes.

Version Control?

Projects are version-controlled based on time-interval or user prompt.

Version control can be handled via git. As long as the format is something that is a line-by-line text as it is in the interface, then the full power of git and github.com is immediately useful here. That way the version control part can be handled entirely on github.com, things like conflict resolution, comparing versions, etc., etc., etc. Padma-Translate simply needs to have user-interface functionality which corresponds with:

Data Reducancy?

One of the critical features of the system is to never lose data. There are three layers of procedure:

Questions and Answers

What if there is more than one Target Language word in the Term Base for a given Source Language word?

Then the word will be highlighted and the context menu will offer the options to choose from.

How Target Segment is Automatically Populated with Target Words and Target Phrases?

The end goal is for all the words in each Target Segment to be automatically completed based on Translation Memory and Term Base. That being the aspirational end-state, the way this actually works can be understood through the below outline.

What system will do first:

  1. Copy the Source Segment into the Target Segment so both segments are in Source Language
  2. Look for Source Phrases in Translation Memory within the Source Segment
  3. In the case of a match, replace Source Phrase in Target Segment with the corresponding Target Phrase
  4. Repeat steps 2 and 3 with Source Words and Term Base

This will result in a state where each Target Segment will be in one of three states:

The first state will initially be very rare. The second will occur occasionally, and the last is the expected state.

What human/s will then have to do is up to four things, depending on the state:

This process will be performed segment-by-segment at the beginning of each new project, and again for any Source Segment upon it being edited (due to finding a typo or other reason).

What is Exactly the Logic for Finding Matches?

There are three cases for matching:

In the first case, the source language word will be automatically replaced by the target language word.

In the second case, the word will be shown in Source Language but will be highlighted and in the context menu, in where the available options for the target language from the custom dictionary will be offered.

In the third case, the word can be added to the Term Base.

The same logic applies to phrases.

How About Non-Exact Matching?

These are excluded from the initial scope. The possibilities here fall roughly under three buckets:

How texts are loaded into the system?

Through a text area where text is copy-pasted. The text is then automatically segmented. When the person comes to the system, the system asks if a project should be loaded, or new should be started. If new is started, then the project will be given a name in the add text dialogue.

How is the text segmented?

The text will be segmented based on ། and such.

Where do terms come from?

The terms come from the custom dictionary that will live in Dictionary-Service (repo is not live yet).

Where do phrases come from?

Same as words. Note that phrases will be handled first, and then what remains will be handled by words.

How does this connect with other systems/repos?

For this RFW to make sense, the following RFCs must be completed:

Padma Translate will directly interact with:

blahmonkey commented 2 years ago

@mikkokotila

Data Reducancy?

Typo there