Closed yosiasz closed 3 years ago
@yosiasz I think the general idea is to create new treebanks as opposed to expanding an existing one. In any case you can then combine them at training time. The first step would be to find some free/open-source text (perhaps the Amharic Wikipedia? And then select some sentences and annotate them. You could ask the Stanford people how many tokens they would like.
I would be more than happy to assist in the process. You could also get in contact with the people who published the paper and see if they would be interested in participating.
thanks very much @ftyers I will look at the paper. For now we have our own resources.
@ftyers so this is what I am thinking. I am a native speaker and have development experience. I want to do the creation of the next treebank in an automated fashion. Start small and keep building on it. Vet the output, clean up any issues, fix the automation code, rerun. wash, rinse , repeat. I don't see doing this manually which could take a very long time. What do you think of this approach? here are a some resources I could use to build a corpus of data to automate the creation of a new treebank. I could going about this totally in the wrong way. if so could you please provide me some guidance.
I have a few thoughts:
If you are interested in coming up with a project plan, feel free to get in contact with me. I'd be happy to discuss a way forward.
Agreed. The automation part is just to create a conllu file with the basic headers info that does not need to be created manually that will then be manually verified.
You could start out with something like this using a rough script based on this. Probably the tokenisation/segmentation would need to be fixed, but that probably needs to happen after the morphological disambiguation.
Amazing! thanks will take a look
ok, started on the script, couple of issues with l3. Once I sort that out going to feed this baby bigger and bigger corpus of data and see what we can come up with. I don't think I will even need stanza anymore. I could just go straight from this new treebank to spaCy. thanks much!
Note, the script won't give you a treebank, it will just give you the tokenisation (roughly) and the undisambiguated morphology. There is still the annotation step for syntactic structure. For that you could use something like UD Annotatrix or CoNLL-U Editor.
But yes, once the treebank is prepared it should be possible to just input it into spaCy, or any other tool supporting CoNLL-U. I have been getting some good results from UDpipe2.
yes I get that. It wont give me a finished treebank but it will give me the basic entries that I dont have to enter manually. I could populate a shell conll-u file with 100K entries in an seconds then go back in and add annotations manually.
@ftyers where is this l3 python module found in?
Greetings,
Would like to request a bigger treebank for Amharic or how can one contribute to this existing treebank?
https://universaldependencies.org/treebanks/am_att/index.html
Thanks