alpheios-project / arethusa-configs

Additional configuration files for Arethusa
0 stars 10 forks source link

english tagset #33

Open balmas opened 9 years ago

balmas commented 9 years ago

@gcelano provided tagsets based upon the Stanford Dependencies and asked that we make these available in Arethusa.

The tag sets he provided were:

https://github.com/gcelano/Stanford_Dependencies/blob/master/morph_tagset_arethusa.json https://github.com/gcelano/Stanford_Dependencies/blob/master/syn_tagset_arethusa.json

balmas commented 9 years ago

Initial versions based upon these can be found in the english branch of arethusa-configs at https://github.com/latin-language-toolkit/arethusa-configs/blob/english/configs/arethusa.morph/en_attributes.json https://github.com/latin-language-toolkit/arethusa-configs/blob/english/configs/arethusa.relation/english.json

And are deployed live on Perseids, accessible by using 'english' as the format of the treebank file.

These still need work though.

balmas commented 9 years ago

@gcelano if I understood what you are trying to do with the morphology correctly, you want only one attribute, pos, (part of speech) and each of the supplied values (CC, CD, DT etc,) are possible values for this attribute.

Using the current aldt treebank schema, I believe we have to have a single character as the mapping value for an attribute in the postag value, so I arbitrarily assigned a single character from a-z0-9 to each of the values in the en_attributes file. We should probably do something more sensible here.

if and when we ever switch to the new version of the treebank schema that we had agreed upon, we would be dropping this postag attribute in favor of more descriptive attributes and wouldn't be limited in this way.

Anyway, play around with it a bit and see what you think.

balmas commented 8 years ago

@gcelano has asked that this be made available on the treebank input form for Sunoikisis.