freedict / tools

This repository contains all the tools of the FreeDict project. This includes the Make build system, various importer scripts, XSL conversion style sheets and more.
http://freedict.org
Other
31 stars 9 forks source link

xsl/tei2c5.xsl: Very long runtime on large input #32

Open respiranto opened 3 years ago

respiranto commented 3 years ago

Given deu-eng-phonetics.tei, running

$ xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /path/to/deu-eng/ /path/to/xsl/tei2c5.xsl build/tei/deu-eng-phonetics.tei >build/dictd/deu-eng.c5

as per the Makefiles, takes about 24 hours.

The problem is most likely not limited to tei2c5.xsl.

The deu-eng dictionary is quite large and has many siblings below the entry level, in case that matters.

See also #31.

humenda commented 3 years ago

This has been a long-standing issue. The problem seems to be somewhere in the way XPath expressions are built, resulting in path depths of > 10,000,000. However, XSL debugging is rather bad in the FOSS space.

There is a related question and I will summarise the discussion around this here for completeness. Instead of fixing the old XSL style sheets, it might be worth to reimplement the style sheets in a different language. This would open the possibility to translate the TEI into an intermediate representation that can then be converted into multiple formats. PyGlossary has a similar goal, though lacks any semantic meaning in its internal formats and is hence not a good fit for FreeDict. So in case somebody would look into something new for conversion, my opinion would be to drop the style sheets, hence I bring it up in this issue. If somebody is able to write XSL better than I do and fixes the style sheets, this paragraph can be ignored.

karlb commented 3 years ago

There is a related question and I will summarise the discussion around this here for completeness. Instead of fixing the old XSL style sheets, it might be worth to reimplement the style sheets in a different language.

I also prefer working with languages other than XSL. Not because I think XSL is necessarily bad, but rather because I always have a hard time debuging non-trivial XSL problems.

This would open the possibility to translate the TEI into an intermediate representation that can then be converted into multiple formats.

In my mind, our TEI files should be the intermediate representation and we should try to improve that instead of adding an additional representation.

PyGlossary has a similar goal, though lacks any semantic meaning in its internal formats and is hence not a good fit for FreeDict.

I don't think we will be able to create a generic converter from TEI to detailed semantic representations. But most dictionary formats are mostly a mapping from headwords to formatted text. To handle those formats, I had good success with creating one format and then converting to other formats using PyGlossary. So the approach I have in mind is:

But ultimately, whoever does the work will get to decide, I assume. I didn't want to miss the opportunity to share my thoughts, though.

humenda commented 3 years ago

I also prefer working with languages other than XSL. Not because I think XSL is necessarily bad, but rather because I always have a hard time debuging non-trivial XSL problems.

Agreed.

This would open the possibility to translate the TEI into an intermediate representation that can then be converted into multiple formats.

In my mind, our TEI files should be the intermediate representation and we should try to improve that instead of adding an additional representation.

The IR is just a terminology that is used in compiler construction, it is not another format, but I would say a stricter version of the TEI version. TEI get's arbitrarily complex: gramgrp within cit or outside? Examples next to quotes or within cit? The IR basically defines abstractly that examples are attached to a particular translation, instead of allowing several different ways to encode it. But it's not utterly important, that's a question of the implementation. I just have been implementing something alike in another project.

PyGlossary has a similar goal, though lacks any semantic meaning in its internal formats and is hence not a good fit for FreeDict.

I don't think we will be able to create a generic converter from TEI to detailed semantic representations. But most dictionary formats are mostly a mapping from headwords to formatted text. To handle those formats, I had

What do you mean by that? The goal is not to foster an ecosystem of output formats for all applications, but to have a parsed representation that can be transformed for dictionary formats. Semantic formats pose their own set of problems, but they're meant for dictionaries so should have a common ground. Of course, this conversion would still ignore any application other than dictionaries. But maybe you have specific dictionary formats in mind for which the semantic conversion would be hard.

good success with creating one format and then converting to other formats using PyGlossary. So the approach I have in mind is:

  • One TEI -> formatted text dict converter (e.g. StarDict)
  • Use Pyglossary to convert that to all other non-semantic formats
  • For each desired semantic target format (are there any at this point?), write a custom converter from TEI

That's an intermediate solution, but not really nice. Do you know of the XDXF project? I would prefer this approach. The argument against this approach is simply that a project of this size should not have multiple converters because maintaining our custom dialect across multiple tools is a nightmare.

karlb commented 3 years ago

I'm all for having a dictionary representation that is more strict. My hope was to maintain dictionaries directly in that format rather than having our dicts in generic TEI and converting to strict TEI. Is there any reason why we could not do that?

I would like to have diversity in dictionary applications and dictionaries rather than having a diversity of formats, each with a low number of applications. I'm also scared of debugging long conversion chains.

I have seen XDXF, but I didn't investigate it enough to see how well it works in practice. What would be the advantages over TEI? A stricter format definition? It looks like they also don't have a single semantic format as a conversion target and rely on PyGlossary for other formats.

This discussion is getting a bit off-topic for this issue. Maybe we should split part of it off into a separate issue if there is interest in more discussion.

respiranto commented 3 years ago

This discussion is getting a bit off-topic for this issue. Maybe we should split part of it off into a separate issue if there is interest in more discussion.

The topic of the discussion seems to have become how to replace the XSL stylesheets (or: how to write exporters). Which is what seems to be the preferred solution to the original issue.

I'd say, we could just rename the issue.

Intertwined with the now-predominant topic is the question how strict our format should be, and possibly, how it should be at all.

On XSLT:

On IR:

On XDXF and Pyglossary: