clarin-eric / DOGlib

Digital Object Gate
GNU General Public License v3.0
0 stars 0 forks source link

odd value in example #6

Closed kosarko closed 3 years ago

kosarko commented 3 years ago

https://github.com/clarin-eric/DOGlib/blame/4c5062dce1354c4abb405fa3de43d4f713471eed/README.md#L46-L49

    {
      "filename": "http://radio.makon.cz/",
      "pid": "https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3698/Etalon.tgz?sequence=1"
    }

this doesn't seem right:

kosarko commented 3 years ago

it is on this one http://hdl.handle.net/11234/1-3422, so maybe that was used previously as an example and the two got mixed up?

dietervu commented 3 years ago

The output in the README file indeed is incorrect.

The output with the current version is as follows. @MichalGawor can you correct the readme accordingly?

{'ref_files': [
{'filename': '', 'pid': 'https://wiki.korpus.cz/doku.php/en:cnk:etalon'}, 
{'filename': '', 'pid': 'https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-3698/Etalon.tgz?sequence=1'}
], 'description': 'Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech National Corpus. The corpus includes fiction (ca 24%), professional and scientific literature (ca 40%) and newspapers (ca 36%). \r\n\r\nThe corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: syntactic word, lemma, sublemma, tag and verbtag. The texts are shuffled in random chunks of 100 words at maximum (respecting sentence boundaries).', 'license': 'http://creativecommons.org/licenses/by-nc-sa/4.0/'}