UniversalDependencies / UD_Swedish-Talbanken

Swedish data
Other
13 stars 2 forks source link

Raw text files #2

Closed fredrijo closed 6 years ago

fredrijo commented 8 years ago

Is it possible to add the raw text files to the git repo? They're extremely valuable, in particular to evaluate a tokenizer for Swedish and also in seeing an NLP pipeline's actual performance from raw text to POS tags/dependency trees.

Feel free to reach out if I can be of assistance with mapping text files to the train/test/dev split, or in any other way help in making the text files available.

And thanks for the fantastic work you're doing in creating this awesome NLP resource!

EmilStenstrom commented 6 years ago

@fredrijo If I understand you correctly you would like to retrieve the original text files from the conllu files? That's possible to do by reading reading through the files and concatenating the words together with a space inbetween. There's a special SpaceAfter=No marker in the files, for the case where the original sentence didn't have a space after that specific token.

dan-zeman commented 6 years ago

And BTW the official UD releases (http://hdl.handle.net/11234/1-2837) contain the reconstructed txt files created the way @EmilStenstrom mentioned. These are generated on the fly at release time, so they are not stored in the Github repository. Maybe we could copy them automatically to the master branch next time.

fredrijo commented 6 years ago

@EmilStenstrom The original reasoning for adding the raw text files was that not all aspects of the formatting are contained using the SpaceAfter marker, e.g. tabs, new-lines, multiple-new-lines, etc. These are useful for evaluating tokenization in e.g. headings, list contexts, etc. I don't believe this can be reconstructed by SpaceAfter alone.

I think I heard a rumour once about not only adding the SpaceAfter boolean flag, but also the content of the white space (a strong of whitespace, tabs, newlines, etc.). Is this something you are considering?

@dan-zeman I think adding the reconstructed text to master automatically would be valuable. Even better would be to have the original files IMO, but that is only a nice-to-have.

martinpopel commented 6 years ago

If you want to encode additional info about whitespace directly at the token level, please use the UDPipe format, see also https://github.com/UniversalDependencies/docs/issues/332. In UD guidelines, it has been decided that the text sentence-level attribute is a better (simpler) option, though it cannot encode e.g. whitespace between sentences.

Note that the raw text for CoNLL2018 is hard-wrapped at 80 characters per line, if I remember correctly. This was good for the shared task, but it may not be good for your purposes.

jnivre commented 6 years ago

I can see the usefulness of this, but please note that, in the specific case of Swedish Talbanken, the original text is not available in digital form. Hence, the "raw text" used in the shared task is just the text that is reconstructed from the SpaceAfter comments + one space between sentences.

fredrijo commented 6 years ago

Thank you for answering this. In particular, using the SpacesAfter/SpacesBefore should remove the need for raw texts.

(Closing the issue as all my questions are answered)