If TTX file is unsegmented, OmT will create faux segmentation

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Create new TTX file
2. Open in OmT and then create target file
3. Open resulting TTX in TagEditor

What is the expected output?
No change in the file

What do you see instead?
Changes (see attachment)

What version of the product are you using? On what operating system?
OmT 2.2.3_1, okapi-pluginForOmegaT_all-platforms_0.10.zip, Trados 2007, WinXP 
Pro SP2.

Original issue reported on code.google.com by afrika...@gmail.com on 12 Feb 2011 at 6:43

Attachments:

[ttx filter omt 12feb11.zip](https://storage.googleapis.com/google-code-attachments/okapi/issue-164/comment-0/ttx filter omt 12feb11.zip)

GoogleCodeExporter commented 9 years ago

Thanks for the sample.
Will look at at.

Original comment by yves.sav...@gmail.com on 12 Feb 2011 at 8:53

Changed state: Accepted
Added labels: Component-TTXFilter
Removed labels: Component-

GoogleCodeExporter commented 9 years ago

Original comment by yves.sav...@gmail.com on 12 Feb 2011 at 8:57

GoogleCodeExporter commented 9 years ago

It seems the issue described is that: Given a paragraph with 2 sentences in the 
original TTX, the resulting TTX has segment markers around the paragraph rather 
than around each of the sentence (while each sentence was translated as a 
different segment in OmegaT (as attested by the project TM)).
That is the current behavior. The reason for this is related to the interface 
between the filter and OmegaT:

- the filter provides the TTX as it to OmegaT's interface: one entry = one 
paragraph since the TTX is un-segmented.

- OmegaT then applies its segmentation rules to each entry, resulting in 2 
segments in OmegaT's UI and TM

- then OmegaT put back together the entry to pass it to the filter that creates 
the output. Because it is a translated entry the filter must put segment 
markers and has to do this for the whole entry because at its level its has no 
knowledge of the OmegaT segmentation.

To get a TTX file with segment marker corresponding to sentences, one has 
currently to pre-segment the TTX file (in Trados or using Rainbow).

A fix would be either:

a) change the way OmegaT interact with filter so it exposes segments rather 
than "paragraph" entries.

b) change the way the filter work by adding a segmentation step before feeding 
the entries to OmegaT. this would also require the project to not segment 
(since it would be done alread). Ideally such segmentation would use the same 
rules as OmegaT does. But--while close to SRX--the rules of OmegaT are 
currently proprietary rules and difficult to use in another software.

Not sure how to resolve this.
But it's certainly a valid issue that need to be addressed.

-ys

Original comment by yves.sav...@gmail.com on 12 Feb 2011 at 9:19

GoogleCodeExporter commented 9 years ago

In the mean time, a nice-to-have would be for OmegaT to refuse to attempt a 
file that is not source=target prepared (either without telling the user why 
his TTX file is being refused or by giving the user a helpful error message).  
But that's for the OmegaT people to decide, right?

Original comment by afrika...@gmail.com on 12 Feb 2011 at 9:35

GoogleCodeExporter commented 9 years ago

>>> It seems the issue described is that: Given a paragraph with 2 sentences in 
the original TTX, the resulting TTX has segment markers around the paragraph 
rather than around each of the sentence (while each sentence was translated as 
a different segment in OmegaT (as attested by the project TM)). <<<

Actually, the problem I have is that the file that OmegaT creates in the end is 
mangled, in the sense that both source and target fields are now translatable.  
I have translated them in the attachment.  This should not be possible.

Original comment by afrika...@gmail.com on 12 Feb 2011 at 9:44

Attachments:

[ttx filter omt 12feb11 more.zip](https://storage.googleapis.com/google-code-attachments/okapi/issue-164/comment-5/ttx filter omt 12feb11 more.zip)

GoogleCodeExporter commented 9 years ago

It seems at least one cause for the the issue is created because the entries 
delimited by the TTX filter include initial the line-break. They get included 
after <Tu> rather than before, and this causes TagEditor to open a segment 
inside existing segments.
I'll see how we can fix this.
-ys

Original comment by yves.sav...@gmail.com on 13 Feb 2011 at 12:39

GoogleCodeExporter commented 9 years ago

I've made some changes to the filter so that, when the original content is 
unsegmented, the leading whitespace characters are moved outside of the created 
entries.
TagEditor seems to work better wit the resulting TTX.

The handling of line-breaks between external codes is not changed yet. I guess 
we would need to force two segments in those cases. I want to test more files 
to see the implications on formats like HTML, etc. before implementing 
something.

The changes are in the latest snapshot (http://okapi.opentag.com/snapshots/)

Original comment by yves.sav...@gmail.com on 13 Feb 2011 at 3:39

GoogleCodeExporter commented 9 years ago

I just wanted to add that this an important issue for me. While it is extremely 
useful to be able to translate an unsegmented TTX file directly in OmegaT, the 
resulting target TTX file with paragraph-level segmentation isn't always 
appropriate for delivery to clients because many clients really expect 
sentence-level segmentation. I  try to translate pre-segmented TTX files where 
possible, but it's not always practical.
Will be looking forward to a solution. Thanks a lot!
Best regards,
Roman

Original comment by velior.i...@gmail.com on 27 Feb 2012 at 3:52

computerline1z / okapi

If TTX file is unsegmented, OmT will create faux segmentation #164