danieldk / conllx

Go reader for the CONLL-X format
Other
4 stars 0 forks source link

CoNLL-U format support #1

Open Shugyousha opened 8 years ago

Shugyousha commented 8 years ago

Would you merge CoNLL-U format support if I send a pull request?

Or do you think it would make more sense to build a separate library for CoNLL-U format support?

danieldk commented 8 years ago

That's an interesting question... It seems to that CoNLL-U introduces some incompatibilities:

This implies a different reader and also a different token representation or in other words a new package :). So, I think it makes the most sense to:

Shugyousha commented 8 years ago

On Tue, Jun 21, 2016 at 01:14:03AM -0700, Daniël de Kok wrote:

That's an interesting question... It seems to that CoNLL-U introduces some incompatibilities:

  • In CoNLL-X there is a direct mapping from lines to tokens, in CoNLL-U not since a line can also represent a span.
  • The PHEAD/PDEPREL columns are repurposed.
  • The POS tag columns get different meanings.

True.

I was thinking of embedding the conllx.Token into a "ConlluToken"- type and that way getting most of the getter/setter methods for free. That's why I was thinking of adding the code to this package. There would be some wasted space due to the unneeded conllx.Token fields though.

This implies a different reader and also a different token representation or in other words a new package :). So, I think it makes the most sense to:

  • Create a separate conllu package.
  • Create a package (conlldep?) that defines an interface for token methods that both conllx and conllu implement.

A possibility would be to just take your getter/setter methods for the interface definition.

Instead of coarse/fine-grained POS tags, the methods for conllu could return UPOSTAG and XPOSTAG. For the PHEAD/PDEPREL accessor methods in the conllu case we could return the parsed DEPS field which contains "secondary dependencies" in a 'headid:deprel' format.

An issue that I can see with implementing such an interface is that the PHEAD/PDEPREL fields have only one value each while conllu's DEPS field contains a "list of secondary dependencies". We could opt for returning a slice in both cases though.

As the interface package name maybe something implying that it contains (interface) types would be desirable: "conlltypes"? Not sure.

Do you think such an approach makes sense?