sentence genres - Githubissues

The sentence IDs in this corpus do not indicate the genre from which the text was drawn, which would be useful to have. I did some digging and found:

The ACL 2013 paper describes version 1.0 of the corpus, of which there are 2200 train/800 dev/1000 test sentences in German. According to the paper they consist of Reviews and News genres (the news data being from the TIGER Treebank, Reviews presumably from Google).
The subsequent 2.0 release has more data: 14118 train/799 dev/977 test sentences. Some of the sentences in 1.0 turned out to be duplicated across splits, which was fixed for 2.0. I could not find in the READMEs any indication of where the new German sentences came from.

Based on the above and the mappings in not-to-release/ud-tiger-mapping.txt, it appears that the genres are:

train: Reviews=s1-s1500, News=s1501-s2200, Web=s2201-s14118
- By searching for a selection of sentences in the s2201-s14118 range, i.e. the new ones in version 2.0, it looks like they are from Wikipedia and other websites
dev: Reviews=s1-s500, News=s501-s799
test: Reviews=s1-s301, News=s302-s977

Should this go in the README?

UniversalDependencies / UD_German-GSD