Open Shugyousha opened 8 years ago
That's an interesting question... It seems to that CoNLL-U introduces some incompatibilities:
This implies a different reader and also a different token representation or in other words a new package :). So, I think it makes the most sense to:
conllu
package.conlldep
?) that defines an interface for token methods that both conllx and conllu implement.On Tue, Jun 21, 2016 at 01:14:03AM -0700, Daniël de Kok wrote:
That's an interesting question... It seems to that CoNLL-U introduces some incompatibilities:
- In CoNLL-X there is a direct mapping from lines to tokens, in CoNLL-U not since a line can also represent a span.
- The PHEAD/PDEPREL columns are repurposed.
- The POS tag columns get different meanings.
True.
I was thinking of embedding the conllx.Token into a "ConlluToken"- type and that way getting most of the getter/setter methods for free. That's why I was thinking of adding the code to this package. There would be some wasted space due to the unneeded conllx.Token fields though.
This implies a different reader and also a different token representation or in other words a new package :). So, I think it makes the most sense to:
- Create a separate
conllu
package.- Create a package (
conlldep
?) that defines an interface for token methods that both conllx and conllu implement.
A possibility would be to just take your getter/setter methods for the interface definition.
Instead of coarse/fine-grained POS tags, the methods for conllu could return UPOSTAG and XPOSTAG. For the PHEAD/PDEPREL accessor methods in the conllu case we could return the parsed DEPS field which contains "secondary dependencies" in a 'headid:deprel' format.
An issue that I can see with implementing such an interface is that the PHEAD/PDEPREL fields have only one value each while conllu's DEPS field contains a "list of secondary dependencies". We could opt for returning a slice in both cases though.
As the interface package name maybe something implying that it contains (interface) types would be desirable: "conlltypes"? Not sure.
Do you think such an approach makes sense?
Would you merge CoNLL-U format support if I send a pull request?
Or do you think it would make more sense to build a separate library for CoNLL-U format support?