czcorpus / InterText_editor

Editor for aligned parallel texts (personal desktop application).
http://wanthalf.saga.cz/intertext
Other
19 stars 1 forks source link

Request for clarification(s) of export/import process #7

Open JoanEliot opened 1 year ago

JoanEliot commented 1 year ago

Thank you for this very useful software.

After aligning two texts, is there a way to export them, as segmented, and then re-import one of them with its segmentation preserved, so that a previously un-aligned text can now be aligned with those segments?

I'm wanting to align several texts to one reference text. I realize that InterText can only align/display two texts at a time, but my project involves five texts each: one in the original language, three translations, plus a transliterated version for the language that is written in a non-Latin alphabet.

I want to use one text (the one in the original language) as a source or reference version, against which each of the three other texts is aligned. Each segment should contain one and only one paragraph of the source version of the text. So after importing the source language text and one of the translations, first I verified that all the segments are set up correctly in the source version. Then I aligned the other text to that source version. Now I want to align each additional translation to the source version as already segmented.

I've been unable to get that to work by any of the export/re-import commands I've tried. The closest I've come is by defining a custom export profile that omits all XML tags and simply separates paragraphs with empty lines, and sentences with line breaks. But when I re-import the exported source language file, along with another text I want to align with it. Paragraph breaks are preserved correctly, but each segment contains (at most) one sentence, not a whole paragraph. (For that matter, segments are lost even if I simply re-import both of the files I've just exported.)

The technique described in section 10.2 of the manual should work for aligning the translation of the version in the non-Latin script with its transliteration. Otherwise, though, aligning translations to each other rather than to the source language text would not be ideal for me.

I've looked for but haven't found an option that would automatically generate one and only one segment for each paragraph in one of the imported files. Any suggestions?

wanthalf commented 1 year ago

I am not sure I understand your problem. But I am afraid it is the common misunderstanding of the alignment principles.

InterText can align several versions of one text, but always in a pair of two versions at the same time, independently, of course. The point is exactly that the alignment segments do NOT need to match across different pairs of text versions. If you want to group several parallel text in one large "table" with common rows (segments) stretching across all text versions at the same time, then you do not need a special software like InterText - you can just do that in any spreadsheet/table editor.

The problem that InterText wants to solve is, that the linguistic segments/element (which are present in one single text version) do not always match the alignment segments within a pair of texts in contrast. If they did (like they do if you have e.g. Bible segmented into verses - which are not really linguistic units, but canonical text-specific segmentation units - corresponding across all languages), there would be no need to align the texts at all: you would just take the corresponding canonical segments (units) and set them 1:1 beside each other. The problem is, that in one pair of texts, two linguistic units - such as sentences or paragraphs - may correspond to one, two or three sentences or paragraphs in another text version. But that may not apply to some other translation in the same way and cannot be just "transfered" to it.

So, you import all the text versions into InterText pre-segmented into textual/linguistic elements (or units) with the granularity you want to align - be it sentences or paragraphs. (And InterText may also help you to segment paragraphs into sentences automatically on import, if you want the finer granularity, but do not have that available.) Then, in InterText, you group the corresponding groups of those "textual elements or segments" (paragraphs or sentences) into groups of "alignment segments". These groups may not be equal in each pair of text versions and that is why you always align only one pair of text versions, and then the other pair independently. An "alignment segment" - unlike the "textual or linguistic element/segment" does not inherently "exist" within one text itself - it only exists as a group of textual segments when contrasted with another text version and its (different) textual segmentation - because it is just a group of equivalent text elements/segments between the particular two text versions and nothing else.

Therefore, InterText lets you align always one pair of texts, so that you are not forced to deal with a potential cross-overlap of several different translations, which might force you to accumulate larger and larger overlapping groups ("alignment segments") of "textual elements/segments"...

For example: Lets say than in the pair of text versions A and B, the sentences (or paragraphs) would correspond in two alignment groups: first group with sentence A1 corresponding to sentences B1 and B2, and second group with sentences A2 and A3 corresponding to B3. If you then added text version C, you might end with some completely different groups: for example first group with A1 and A2 corresponding to C1, and second with A3 corresponding to C2 and C3. You cannot split sentence C1 according to your previous alignment of A1 with B1 and B2, and also of A2 and A3 with B3. The grouping between A and B is completely irrelevant when you deal with A and C. Otherwise, you would need to forget and redo all the alignment of A and B with respect to C, and eventually end up with just having one large alignment segment: the whole text A corresponding to the whole texts B and C, and there would be no finer alignment possible.

So, you cannot transfer some "finished" alignment of one pair of texts into the alignment of other pair of texts without abandoning it completely and potentially losing the finer granularity of the alignment. You would always have to adapt to/align all the texts together at the same time.

If that is exactly what you want and need (at any price), then you really do not need InterText and you may use just any kind of simple table editor, where you compare and group all the text versions in columns at the same time. But I understand that table editors may be more difficult to deal with larger texts if you do not master text processing using traditional unix-based command line tools, which can help you merge several texts into one large table and then split them again.

wanthalf commented 1 year ago

Or - if the problem is that you just want to segment a text into paragraphs (as real text segments inherent to one text version), then you have to do it in a text editor before importing the text version into InterText. ... for each text version independently, of course.

(You can later also correct any possible mistakes in the paragraph segmentation of each text in InterText, but it is not its primary goal.)

JoanEliot commented 1 year ago

Thank you very much for your detailed, clear answer, and for the time you devoted to it. I apologize for not making plain that I do understand the concepts you so well explain: I appreciate that InterText is designed for alignment of "meaning units" in pairs of corresponding texts. In a set of texts these meaning units often cross sentence/paragraph/etc boundaries, and may be distributed very differently in each text, so that the freedom to vary segmentation on a per-pair basis makes great sense.

In my situation, though, I've found it necessary to "lock" the paragraphs of all the translations to those in the source-language version. I recognize that there may "meaning fragments" in the translations that cross the paragraph boundaries established by the source-language text, but in the texts I'm working with, such differences in meaning distribution are uncommon and trivial enough that the costs of re-segmenting the source-language text accordingly aren't justified by the modest increase in alignment accuracy.

I guess I can imagine a table editor that would work for what I'm doing, but I haven't been able to find one that has anything like InterText's facilities for identifying paragraphs and sentences separately within table cells, with each sentence ("element") maintained as a separate item, and for efficient manipulation of sentences and of paragraph identifiers.

So do you know of a table editor designed for parallel texts? Before I discovered InterText I had given up on Excel, Word, and a few dedicated table editors I had found. I wound up writing a relational database program that organizes parallel texts into a locked sequence of paragraphs, that allows for an arbitrary number of sentences per paragraph, per text, and that lets me annotate each text separately, sentence by sentence. My program works well enough, much better (for me) than a generic table editor, but it doesn't offer anywhere near the efficiencies of InterText for moving things around, combining and separating them, etc.

So for now, I will bite the bullet and use InterText to manually re-segment each original-language/translation pair in turn. I've used a custom export to add segment/paragraph numbers to the source plain text file containing the original language text. That makes it much quicker to re-segment the original, and then I can go ahead and line up the translation with it.

So again, thanks very much for this software. Were I to be making feature requests—which I would be out of place to do—these would be:

  1. Columns displaying the sentence and paragraph numbers for both texts, updated as the user moves/splits/combines sentences and adds or deletes paragraph marks. The numbering of segments at the far left is already very helpful.
  2. Highlighting for segments that have unequal numbers of elements on the two sides. 2:2, 3:3, segments, etc., would not be highlighted, but 2:3, 1:4, etc. would be highlighted.
  3. Of course, I would like the option to export and import segmentation, perhaps using something as simple as two blank lines as the default segment delimiter in plain text files. During import of files containing segmentation delimiters the program would require the user to identify the file whose segmentation will be preserved (only one file's segmentation scheme could be controlling, naturally).
wanthalf commented 1 year ago

Hm. Still not clear what exactly you want to achieve. If you want to align to some predefined/inherent units of the source text, you can just pre-define them in advance and then keep the alignment fixed to the units on the source side. E.g. you can keep the rows so that there is always exactly one paragraph on the source side and you only modify and re-segment the target text to fit the source paragraphs (and possibly break into smaller units where necessary). That is quite easy.

As for export: the ParaConc and TMX exporters show how to insert alignment segments into the exported files. But you do not need to export anything in order to import it back into InterText again. InterText keeps the inherent segmentation of the texts even if you change it while editing it within alignment with one of the target texts. The text is always kept in one instance only, so that a change of its structure always affects all its alignments (InterText only ensures that it does not break any of the other alignments and warns you if it does). It only separates the alignment grouping, but not the inherent structures of the texts.

As for the feature requests:

  1. Sentence and paragraph numbers are shown on mouseover. I have not thought about them being so important to occupy the space of the table view, but technically it would not be a big problem, just another "display option".
  2. Highlighting of segments with unequal amount of elements would definitely not be any problem, just a minor modification and a new option. But maybe someone else would prefer other option and it would be worth to do it more generally? I do not know. There has not been such a request yet.
  3. Inline segmentation of the inherent textual units is fully supported both on import and export. You can have your texts fully annotated in XML and the whole structure is kept. Plain text segmentation is a bit more limited than XML, of course, to the common types I have actually seen in use - that is one or two line breaks. I think it should be configurable quite freely, but there may be some limits. (Also, I am now a bit confused between the "InterText server" and "InterText editor" - both being more or less flexible but in different ways. The "server" is possibly easier to be extended by editing and extending the export methods using PHP script, while the "editor" is a more "hardwired" as a compiled C++ application. So is the inline export of the alignment segments actually possible as well (as for the ParaConc or TMX export), but that is always limited to a particular pair of texts, as I have largely explained yesterday. Therefore, two exports of the same primary text version in the context of two different alignments will be different, because the alignment segment borders will be inserted (inline) to different places of the text in each case. And that is also the reason why this cannot be imported back into InterText easily and why the alignment is kept in the "stand-off" form by default: in InterText, one text version is only kept in one instance (however many alignments it takes part in) and thus only with its inherent structure present. Its alignments to different text versions must be kept separately (stand-off) from the text itself. So, while you could strip that non-inherent inline segmentation on import, it would be of no help, because you cannot just import the same text twice. On the second alignment import you could only import the second alignment information as stand-off information.

So, I still do not see a problem that you could not solve with InterText - you just have to make clear what is the inherent structure of the texts and what is just an alignment-dependent grouping. Then you may have some special demands on the correspondence between the inherent structural units and their alignment, but that should be more or less easily solvable, I hope.

I would also be glad to add new features, if I am clear about their purpose and if they do not break the principles of the application; but I have to admit I do not have that much time for that currently and I have not touched the application for quite some time, so that I am becoming a bit afraid of what might get broken. Also, the Qt framework is developing and old versions are getting out of use and the new versions may require some more unexpected changes and cause things to get broken. I remember I avoided to compile the last update for Linux, because the newer version of Qt would make the application stop working on not-so-old versions of Linux distributions, people probably still may be using. Yeah, those are the pitfalls of a multiplatform application maintenance...

JoanEliot commented 1 year ago

Forgive my tardy reply, and thanks for your further efforts to help out. As I said before listing the additional features that I think would be helpful, those were only "imaginary requests." I understand entirely your concerns about breaking something; even if you had no Qt updates to deal with, I'm sure there are many other ways you might occupy your time.

I'm sorry I haven't been able to state clearly the problem I'm trying to solve, but from your most recent reply I think you do understand it. And you give a solution that addresses it directly:

If you want to align to some predefined/inherent units of the source text, you can just pre-define them in advance and then keep the alignment fixed to the units on the source side. E.g. you can keep the rows so that there is always exactly one paragraph on the source side and you only modify and re-segment the target text to fit the source paragraphs (and possibly break into smaller units where necessary). That is quite easy.

But this is exactly what I haven't been able to do. My plain text files have line breaks between sentences and empty lines between paragraphs. When I create a new alignment, InterText places each sentence is in a separate segment—but I want segments to contain whole paragraphs, with each sentence an element within its segment/paragraph. I fiddled with Settings->Import but without success. Is there something I should be changing there? Should I change the "create elements" from the default entry, "p"? Any other pointers?

Regarding my imaginary feature request 3: more generally, it would be helpful to me if I could re-use the segmentation of one text as aligned so that this one "reference text" could be re-opened with those segments preserved. I could open then any other text and align that second text to the segments already present in the reference text, independent of, but preserving, the inherent structural units of the second text. It's clear from your replies that your concept for InterText is that alignments are only meaningful for specific pairs of texts. Thus, by design, only pairs of text can be saved and reopened with preservation of alignment segmentation. In theory, the program could open only one text from a previously saved alignment and use the stand-off/separate alignment file to segment the chosen text. Then a an arbitrary second text could be opened and aligned to the first. But that's not how the software works.