unannotated text in UD languages?

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

269 stars 245 forks source link

unannotated text in UD languages? #272

Closed jeisner closed 7 years ago

jeisner commented 8 years ago

Suppose someone wanted additional unannotated text in each of the UD languages -- e.g., to compute word embeddings, or for some other purpose. Is there a standard, curated collection of this sort?

Thanks! -jason

fginter commented 8 years ago

@jeisner: There is no official such collection, but at least the W2C dataset has data for the majority of UD languages. Some of these corpora might not be large enough to induce embeddings, but many will.

Filip

jeanm commented 8 years ago

I noticed UD_French has a comment above each annotated sentence with the unnanotated text. For example:

# sentid: fr-ud-dev_00001
# sentence-text: Aviator, un film sur la vie de Hughes.
1   Aviator _   PROPN   _   _   0   root    _   _
2   ,   _   PUNCT   _   _   1   punct   _   _
3   un  _   DET _   _   4   det _   _
4   film    _   NOUN    _   _   1   appos   _   _
5   sur _   ADP _   _   7   case    _   _
6   la  _   DET _   _   7   det _   _
7   vie _   NOUN    _   _   4   nmod    _   _
8   de  _   ADP _   _   9   case    _   _
9   Hughes  _   PROPN   _   _   7   nmod    _   _
10  .   _   PUNCT   _   _   1   punct   _   _

Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional # sentence-text: <unannotated text> line before each annotated sentence.

jeisner commented 8 years ago

Thanks very much! -cheers, jason

dan-zeman commented 8 years ago

@jeanm : Your comment is a separate issue, moving to #273 and closing this one.

yoavg commented 8 years ago

I am sorry to re-open a closed issue, I wanted to comment while it was still open but somehow other things took precedence.

I think @jeisner raised an important issue, and that we may want to consider curating and providing standarized unannotated corpora for the different languages. Precisely because of the move of many researchers to methods that involve the use of un-annotated data (and sometimes even rely on such data being available), if we want to be able to accurately compare results, we should allow people to train on the exact same data.

Ideally, each of the treebank providers will also provide a large corpus matching in genre and style to the treebank sentences, but of course this may not be possible due to copyright issues, and then we could default to wikipedia or some other source, and note the discrepancy. The important thing is providing a large enough, standard (in terms of the included sentences and the pre-processing steps) set of un-annotated data for use in "official" parser evaluations.

fginter commented 8 years ago

Hi. Agreed. In an attempt to do something about this, we have launched a language recognition run on the CommonCrawl data and hope to gather plenty of web data for most of the UD languages. Of course that does not fulfill the "must be similar to the treebank" requirement, but hopefully will eventually fulfill the "is big enough" requirement.

jeisner commented 8 years ago

That's great - thanks so much!!

dan-zeman commented 7 years ago

Can this issue be closed again?

fginter commented 7 years ago

This should happen as a part of the CoNLL-ST and the data will be available in the spring.

fginter commented 7 years ago

https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989