byutrg / p5-Convert-TBX-UTX

A converter for termbase exchange files in UTX format to TBX:NNY (no name yet, aka TBX-Min) and a converter in the reverse direction.
0 stars 1 forks source link

no term extracted from simple file #9

Open garfieldnate opened 10 years ago

garfieldnate commented 10 years ago

Here's a sample TBX-Min file:

<TBX dialect="TBX-Min">
    <header>
        <id>TBX sample</id>
        <languages source="de" target="en"/>
    </header>
    <body>
        <entry id="C002">
            <langGroup xml:lang="en">
                <termGroup>
                    <term>dog</term>
                </termGroup>
            </langGroup>
        </entry>
    </body>
</TBX>

I ran that through tbxmin2utx and got this:

#UTX 1.11; de/en; Dictionary ID: TBX sample;
#src    tgt src:pos concept ID

So the dog term wasn't output in the UTX file.

SerdoSchofield commented 10 years ago

So, after some snooping I realized what the problem is. I designed the converter in such a way that it requires a source language term in order to work. I did this because UTX specs do not allow for a blank source term. Even though 'dog' was present, it was from the target language and there was nothing from the source language. So it did not convert because there was not a way to convert to valid UTX.

Upon further inspection I realized that the method I used to keep valid UTX format by preventing blank source terms is flawed so that I inadvertently blocked it from converting a single source term without a target term either, even though target terms are allowed to be blank (which is kind of counter-productive...)

I was trying to do a quick patch to just use '-' as a placeholder for undefined source terms (although I don't think this is valid UTX) it made me realize a (perhaps) larger problem that I am going to have to spend the weekend on. Namely, the current converter cannot correctly convert a TBX file if it has multiple source terms within a single language group (It will only keep the last one).

Thanks for pointing this out because I had missed that in all of my tests.

garfieldnate commented 10 years ago

Oh, I hadn't noticed that it was the target language. Well, I don't think there's any reason to necessarily disallow it in TBX-Min, so I guess it has to be accounted for. An <entry> can have 1 or more <langGroup>s, and and <langGroup> can have 1 or more <termGroup>s, so that is a concern, too.