freedict / tools

This repository contains all the tools of the FreeDict project. This includes the Make build system, various importer scripts, XSL conversion style sheets and more.
http://freedict.org
Other
31 stars 9 forks source link

XSL: `<def/>` inserts too many spaces #13

Open humenda opened 6 years ago

humenda commented 6 years ago

A TEI element like this:

<sense>
  <def>my trans</def>
</sense>

Leads to:

1.
  my trans

This should be fixed.

bansp commented 3 years ago

Is this still a live issue? I try to avoid XSLT 1 at all costs, but can have a look, especially if there's a real example of that, which I can test the solution agaist.

karlb commented 3 years ago

The file

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="freedict-dictionary.css"?>
<?oxygen RNGSchema="freedict-P5.rng" type="xml"?>
<!DOCTYPE TEI SYSTEM "freedict-P5.dtd">
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:wikdict="http://www.wikdict.com/ns/1.0">
    <text>
        <body xml:lang="de">
            <entry>
                <form>
                    <orth>abel</orth>
                </form>
                <gramGrp>
                    <pos>suffix</pos>
                </gramGrp>
                <sense>
                    <cit type="trans" xml:lang="sv">
                        <quote>bar</quote>
                    </cit>
                    <sense>
                        <def>def1</def>
                    </sense>
                    <sense>
                        <def>def2</def>
                    </sense>
                </sense>
            </entry>
        </body>
    </text>
</TEI>

gives the following result when processed with xsltproc tei2c5.xsl def-example.tei:

abel
abel <suffix>
bar 2.
def1
 3.
def2

where I would have expected something like

abel
abel <suffix>
bar
  1. def1
  2. def2

I usually touch neither the XSL nor the c5 files, so I'm not sure this the correct example for this problem.

bansp commented 3 years ago

Thanks Karl, this is an interesting example, even if a bit unusual. What is the structure of the information here, please? I understand the first <cit> that provides a translation equivalent in Swedish. Do the further <sense> elements define -abel in German?

The conversion script assumes a uniform sequence of <sense> elements, so I'm guessing it gets confused with

cit
sense
sense

and wrapping the <cit> into its own <sense> would help a lot (probably an @n attribute on <sense> would help as well, because that takes precedence over just counting the elements). But I don't want to speculate further, at this point.

humenda commented 3 years ago

and wrapping the <cit> into its own <sense> would help a lot (probably an @n attribute on ` would help as well, because that takes precedence over just counting the elements).

I don't really like the n attribute, because counting is what machines are better at than humans. Wrapping the cit in a sense feels more natural to me. Is the TEI standard concerned about this? If not, I would still a stricter interpretation on our side, it makes things much easier to handle.

bansp commented 3 years ago

Agreed about @n, it's for when you want to override the machine. I threw together some entries at https://github.com/freedict/fd-dictionaries/tree/master/shared/testing, see https://github.com/freedict/fd-dictionaries/blob/master/shared/testing/test_1.xml . Going to add some more, and tinker inside that file. But for now, it can serve to illustrate why I asked about how the info in Karl's example was structured.

And yes, Sebastian, you're right: we need to constrain stuff ourselves, the TEI is just a toolkit. Fortunately, we have some emerging standards (and our practice) to guide us.

karlb commented 3 years ago

I understand the first that provides a translation equivalent in Swedish. Do the further elements define -abel in German?

Yes. The entry has two senses, but both translate to the same Swedish word. This a a very common thing in the WikDict dictionaries and this kind of grouping has made the output a lot more readable on www.wikdict.com, so I replicated it in the TEI files. I think I asked for suggestions on how to encode it when I first did this and this was the result.

For important words, there are multiple of these groups where one translation applies to multiple senses. I tried to keep the example minimal, so I included only one. See https://www.wikdict.com/de-en/haus for the HTML version of such a case.

bansp commented 3 years ago

Oh, that page looks really nice! Two questions:

You were absolutely right about the shorter example showing things better, but I lacked some context, now I have it. I probably wasn't part of that discussion that you mentioned, through my fault alone. A quick thought is that the example does make sense indeed, especially if you were to provide some more details for each of the (sub)senses, like PoS, pronunciation, etc. I can see now that it does make sense for the stylesheets to handle such structures better.

humenda commented 3 years ago

I can see now that it does make sense for the stylesheets to handle such structures better.

It would be great to have your solution documented in the HOWTO at https://github.com/freedict/fd-dictionaries/wiki/FreeDict-HOWTO-%E2%80%93-Writing-Text-Encoding-Initiative-XML-Files

Thanks

karlb commented 3 years ago

there are numerical references there (like [7]) that don't seem to point to an easily identifiable spot, are you going to make them work at some further step? I initially thought that it's just a matter of changing bullets into numbers, but that won't work

Those refer to specific sense numbers on the original Wiktionary pages. That's mostly used on complicated pages in the German Wiktionary and hard to match, due to the unstructured approach of Wiktionary (everything is just wiki markup). I might be able to that that at some point, but it is not an easy task, and it might require changes in the dbnary project I am building on.

(off-topic in this ticket, but you were expecting this ;-) ): what's the rule for the absence of slashes on the left vs. on the right, and are there some examples of a mix of slanted and square brackets, as you mentioned elsewhere?

My approach is to preserve things as they are in Wiktionary, assuming that the page authors know better than I do. Many Wiktionaries prefer one version (slashes or brackets) over the other use that on most pages.

I would expect correct entries to be enclosed by slashes on both sides or brackets on both sides. Leaving one off or mixing them for a single pronunciation is wrong, as far as I know. When such cases happen, it could be wrong in the Wiktionary page or an error during the extraction (parsing sensible content from mostly presentational markup is messy). I have to investigate on a case-by-case basis to find out.

Feel free to open issues on https://github.com/karlb/wikdict-gen/ if you see anything suspicious.