UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Advice on contributing a new language (Faroese) #336

Closed Bjartensen closed 8 years ago

Bjartensen commented 8 years ago

Hello, I hope this is an acceptable place to pose this question.

There is a 100k word POS-tagged (by hand) corpus in a language not currently in the UD list, and I was wondering how compatible/strong the corpus is.

It is tagged according to this scheme. So a word could be tagged with SFSNP, meaning its substantive, feminine, singular, nominativ, proper noun. I am not well versed in NLP/comp.ling., but I've been reading a bit on the Universal Dependencies website, and it looks like you need POS tags and "features". The POS tags seem much coarser than the above tag set, so I'm assuming some of the tags contain features.

If someone could comment on this, that would be really great. I want to know how good I can expect the corpus to be before starting on formatting it in the desired way. If the particular tag set I have linked is standard for some corpora, and there are tools to convert them to other formats, such information would be super helpful as well.

Also if there are suggestions of features that should be added to the corpus, that would appreciated as well. There has been some discussion on improving the current corpora we have, so knowing what to improve would be really helpful.

All of this was sparked by the Google SyntaxNet parser, and the possibility of getting a parser in our language is highly desirable. But it's just as important for us to standardize our NLP resources so that our language can easily tag along cutting edge NLP applications.

I'm attaching the tagset pdf, but if it's bad practice to trust pdf files, I did link an Imgur screenshot earlier. tagset.pdf

dan-zeman commented 8 years ago

Hi @Bjartensen, welcome to the UD community! It's great to hear that Faroese can join the family of languages covered by UD.

As for "contributing the new language", there are two possible steps. The first one is that we can create a section about Faroese in the documentation and you can describe your language-specific examples there (possibly using documentation of another language, e.g. Swedish, as a model). The second step would be to actually release your data, but for that we would also need the dependency annotation (not only parsed but also manually checked), as UD-released data are treebanks. (On the other hand, lemmas and features are optional.)

As for your second question, I will have to look at your tagset description to see whether I can comment/advise on it. In general, morphological tagsets like yours are converted in UD partially as POS tags (i.e. SFSNP1 would result in PROPN) and the info that does not fit in POS tags is represented using features (i.e. SFSNP would furthermore generate Case=Nom|Gender=Fem|Number=Sing).

Best, Dan

Bjartensen commented 8 years ago

@dan-zeman yeah the idea was to actually release the corpus, but I expect that I will have to write some scripts to convert the tags to some other format. I will have to ask for permission before releasing the actual data, but I don't expect any problem there.

Thanks for the quick reply!

dan-zeman commented 8 years ago

The tagset looks pretty standard and it could be converted via Interset (http://ufal.mff.cuni.cz/interset) where I just added the description of your tagset (https://github.com/dan-zeman/interset/commit/17bd81eda41d233b1441c9b3ead13a0dbfdb6e10). However, there are some pieces missing in the description and I would need your help to fix it. More in a separate issue: https://github.com/dan-zeman/interset/issues/3

ftyers commented 8 years ago

Hey! Trond Trosterud has written a rule-based parser that might be partially applicable. [1][2]

I'd be happy to help out with any conversion tasks.

PS. Where can this 100k word corpus be found ? :D

  1. http://www.lrec-conf.org/proceedings/lrec2010/pdf/254_Paper.pdf
  2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.574.7029&rep=rep1&type=pdf
fginter commented 8 years ago

Hi, I added Faroese into the docs and main page, and created the repository. Welcome on board. :)

Filip

ftyers commented 8 years ago

Is the corpus already dependency analysed, or just part-of-speech tagged ?

If it is not dependency analysed, and the licence is not free, then it might be worth taking another corpus (e.g. Wikipedia) and analysing that.

I've extracted the some sentences from the Faroese Wikipedia, and run them through the tools I mentioned earlier. The dependency syntax rules aren't working (I've let Trond know about), but I wrote a few UD-specific rules and I think it would be fairly easy to edit to add the heads/labels (and of course convert to CoNLL-U and UD compatible POS/Feature labels).

https://svn.code.sf.net/p/apertium/svn/languages/apertium-fao/texts/wikipedia.vislcg.txt https://svn.code.sf.net/p/apertium/svn/languages/apertium-fao/texts/puupankki/puupankki.fao.conllu

Bjartensen commented 8 years ago

Thanks for the replies everyone!

I have just sent an email asking for permission on trying to get Faroese on UD, but I'll go ahead and link you to the corpus. It's on a website anyone can access, and has some GPL and Mozilla open source licenses attached, so it should be good.

You can find the corpus on this site, on the bottom ("Tak niður" is download). Direct link.

@ftyers That sounds great! I don't think the corpus is dependency analyzed; it just says "POS-tagged corpus of faroese".

@fginter Thanks!

ftyers commented 8 years ago

@Bjartensen ok! ... Yes, I took a look and the corpus is POS tagged, but without lemmas or dependency analysis. In order to make this useful for UD it would need to be annotated for dependency structure. If you are interested in doing this, please feel free to contact me and I would be happy to help out... :) Around 1,500 sentences, or 10,000 tokens would be a good baseline to start training a statistical parser on.

dan-zeman commented 8 years ago

@Bjartensen thanks for the link! I completed coverage of the tagset by Interset, converted the corpus and added it to the UD_Faroese repository. Redistribution is OK, they have released it under GPL (and also LGPL and MPL). However, if you know where to ask about licensing, it won't hurt if you do so, because we would actually prefer to use one of the CreativeCommons licenses; the reasons are discussed in #296.

Yes, there is only morphology at the moment. I also noticed that sentence segmentation is not perfect, so it should be improved before any syntactic annotation takes place.

dan-zeman commented 8 years ago

@Bjartensen, I have generated an invitation for you to UniversalDependencies; look in your e-mail. After accepting the invitation you should have push access to UD_Faroese and edit rights for the language-specific documentation on universaldependencies.org.

Add your name to the Contributors line in the README.txt file (more on metadata format here: http://universaldependencies.org/language_metadata.html).

Bjartensen commented 8 years ago

@dan-zeman Thanks! I added my name to the contributors field.

I got a reply on the license, and they said there is no problem using it. Not sure about changing licenses, but I could ask them. How important is it to change the license at this time?

ftyers commented 8 years ago

@Bjartensen not important at the moment. Do enquire if you can use it under the CC-BY-SA licence too though.

Bjartensen commented 8 years ago

@dan-zeman @ftyers How would one best proceed improving/adding to the corpus? Does it involve manual linguistic work? If it does, is a native speaker like me enough, or do we need to pull faroese linguists into this? FMD (Føroyamálsdeilin, or the Faculty of Language and Literature of the University of the Faroe Islands) have expressed interest in improving the corpus. But I'd like to spare them of work at this time if possible.

Bjartensen commented 8 years ago

@ftyers ok, I'll ask them the next time I correspond with them.

ftyers commented 8 years ago

@Bjartensen ... the next step is to start analysing the sentences for dependency structure. You don't need to be a trained linguist, but if you've never done dependency analysis before you will likely need to do some reading and receive some help (and training). I'm more than happy to give you a brief introduction online if you have some means of chat-like communication. If you know IRC, I'm on #_u-dep on irc.freenode.net, otherwise email me for other contact methods.

dan-zeman commented 8 years ago

OK, I think we can close this issue, as the seed has been planted (feel free to reopen if you think otherwise). @Bjartensen, @ftyers I'm leaving it to you but let me know if I can be of any help of course.

Bjartensen commented 8 years ago

@dan-zeman alright thanks for all the help!

@ftyers yes I'm interested in seeing how I can start analyzing dependency structure. I'll hang around in IRC.