UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
271 stars 245 forks source link

Amharic treebank tiny #758

Closed yosiasz closed 3 years ago

yosiasz commented 3 years ago

Greetings,

Would like to request a bigger treebank for Amharic or how can one contribute to this existing treebank?

https://universaldependencies.org/treebanks/am_att/index.html

Thanks

ftyers commented 3 years ago

@yosiasz I think the general idea is to create new treebanks as opposed to expanding an existing one. In any case you can then combine them at training time. The first step would be to find some free/open-source text (perhaps the Amharic Wikipedia? And then select some sentences and annotate them. You could ask the Stanford people how many tokens they would like.

I would be more than happy to assist in the process. You could also get in contact with the people who published the paper and see if they would be interested in participating.

yosiasz commented 3 years ago

thanks very much @ftyers I will look at the paper. For now we have our own resources.

yosiasz commented 3 years ago

@ftyers so this is what I am thinking. I am a native speaker and have development experience. I want to do the creation of the next treebank in an automated fashion. Start small and keep building on it. Vet the output, clean up any issues, fix the automation code, rerun. wash, rinse , repeat. I don't see doing this manually which could take a very long time. What do you think of this approach? here are a some resources I could use to build a corpus of data to automate the creation of a new treebank. I could going about this totally in the wrong way. if so could you please provide me some guidance.

https://am.wikipedia.org/ https://amharic.voanews.com/

ftyers commented 3 years ago

I have a few thoughts:

If you are interested in coming up with a project plan, feel free to get in contact with me. I'd be happy to discuss a way forward.

yosiasz commented 3 years ago

Agreed. The automation part is just to create a conllu file with the basic headers info that does not need to be created manually that will then be manually verified.

ftyers commented 3 years ago

You could start out with something like this using a rough script based on this. Probably the tokenisation/segmentation would need to be fixed, but that probably needs to happen after the morphological disambiguation.

yosiasz commented 3 years ago

Amazing! thanks will take a look

yosiasz commented 3 years ago

ok, started on the script, couple of issues with l3. Once I sort that out going to feed this baby bigger and bigger corpus of data and see what we can come up with. I don't think I will even need stanza anymore. I could just go straight from this new treebank to spaCy. thanks much!

ftyers commented 3 years ago

Note, the script won't give you a treebank, it will just give you the tokenisation (roughly) and the undisambiguated morphology. There is still the annotation step for syntactic structure. For that you could use something like UD Annotatrix or CoNLL-U Editor.

But yes, once the treebank is prepared it should be possible to just input it into spaCy, or any other tool supporting CoNLL-U. I have been getting some good results from UDpipe2.

yosiasz commented 3 years ago

yes I get that. It wont give me a finished treebank but it will give me the basic entries that I dont have to enter manually. I could populate a shell conll-u file with 100K entries in an seconds then go back in and add annotations manually.

yosiasz commented 3 years ago

@ftyers where is this l3 python module found in?

ftyers commented 3 years ago

@yosiasz I got it from here. Btw, we might want to take this discussion to email to avoid the inbox spam for the other developers. :)