Closed jeisner closed 7 years ago
@jeisner: There is no official such collection, but at least the W2C dataset has data for the majority of UD languages. Some of these corpora might not be large enough to induce embeddings, but many will.
Filip
I noticed UD_French
has a comment above each annotated sentence with the unnanotated text. For example:
# sentid: fr-ud-dev_00001
# sentence-text: Aviator, un film sur la vie de Hughes.
1 Aviator _ PROPN _ _ 0 root _ _
2 , _ PUNCT _ _ 1 punct _ _
3 un _ DET _ _ 4 det _ _
4 film _ NOUN _ _ 1 appos _ _
5 sur _ ADP _ _ 7 case _ _
6 la _ DET _ _ 7 det _ _
7 vie _ NOUN _ _ 4 nmod _ _
8 de _ ADP _ _ 9 case _ _
9 Hughes _ PROPN _ _ 7 nmod _ _
10 . _ PUNCT _ _ 1 punct _ _
Many other treebanks have some sort of sentence id in the comments too, but they all use different formats. It would be nice if these conventions could be standardized, perhaps by specifying an optional # sentence-text: <unannotated text>
line before each annotated sentence.
Thanks very much! -cheers, jason
@jeanm : Your comment is a separate issue, moving to #273 and closing this one.
I am sorry to re-open a closed issue, I wanted to comment while it was still open but somehow other things took precedence.
I think @jeisner raised an important issue, and that we may want to consider curating and providing standarized unannotated corpora for the different languages. Precisely because of the move of many researchers to methods that involve the use of un-annotated data (and sometimes even rely on such data being available), if we want to be able to accurately compare results, we should allow people to train on the exact same data.
Ideally, each of the treebank providers will also provide a large corpus matching in genre and style to the treebank sentences, but of course this may not be possible due to copyright issues, and then we could default to wikipedia or some other source, and note the discrepancy. The important thing is providing a large enough, standard (in terms of the included sentences and the pre-processing steps) set of un-annotated data for use in "official" parser evaluations.
Hi. Agreed. In an attempt to do something about this, we have launched a language recognition run on the CommonCrawl data and hope to gather plenty of web data for most of the UD languages. Of course that does not fulfill the "must be similar to the treebank" requirement, but hopefully will eventually fulfill the "is big enough" requirement.
That's great - thanks so much!!
Can this issue be closed again?
This should happen as a part of the CoNLL-ST and the data will be available in the spring.
Suppose someone wanted additional unannotated text in each of the UD languages -- e.g., to compute word embeddings, or for some other purpose. Is there a standard, curated collection of this sort?
Thanks! -jason