clarin-eric / resource-families-issues

4 stars 0 forks source link

English-Czech Corpus from Wikipedia #53

Closed jakoble closed 5 years ago

jakoble commented 5 years ago

http://hdl.handle.net/11234/1-1932

stranak commented 5 years ago

@jakoble The file description says "Text file with Czech and English sentences". Should we say elsewhere in Description that it is "sentence aligned"? It is what it says, just Czech and English sentences ...

jakoble commented 5 years ago

Hi Pavel,

My idea was that such information should be presented as clearly as possible. I.e., does the description "text file with czech and english sentences" entail alignment or merely imply it? So, metadata should overtly spell out the type of "annotation" that the corpus displays.

Best, Jakob

Univerza v Ljubljani Filozofska fakulteta asist. Jakob Lenardič

Oddelek za prevajalstvo / Department of translation

Filozofska fakulteta / Faculty of arts

Aškerčeva cesta 2, SI-1000 Ljubljana, Slovenija / Slovenia T.: 241-1143 Jakob.Lenardic@ff.uni-lj.simailto:Jakob.Lenardic@ff.uni-lj.si, www.ff.uni-lj.sihttp://www.ff.uni-lj.si/ [Univerza v Ljubljani]http://www.uni-lj.si/


From: Pavel Stranak notifications@github.com Sent: Monday, October 29, 2018 1:31:53 PM To: clarin-eric/resource-families-issues Cc: Lenardič, Jakob; Mention Subject: Re: [clarin-eric/resource-families-issues] English-Czech Corpus from Wikipedia (#53)

@jakoblehttps://github.com/jakoble The file description says "Text file with Czech and English sentences". Should we say elsewhere in Description that it is "sentence aligned"? It is what it says, just Czech and English sentences ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/clarin-eric/resource-families-issues/issues/53#issuecomment-433892532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ajd0GsrjovnvVbqnWojFDeppi8l9aTKlks5upvU4gaJpZM4W2U2d.

stranak commented 5 years ago

@jakoble I understand your view. However, I do think the description here, as well as in #52 and elsewhere is "overt", i.e. it is not secret, isn't hiding anything, etc. You can argue it can be even more explicit. But I would prefer to allow some freedom here. We could try to invent super-specific guidelines to describe metadata for alignment of parallel corpora, etc., but I don't think it is a way to go.

On our website you can see we say it is "a text file with czech and English sentences" and there is also a preview of the file there. You can see it is just insentience//cz sentence//en sentence ...

It seems to me simpler to just download the file to see the details, than to invent super-specific metadata for it. See also #52. I don't want to be forced by some guideline to describe the format of those files, etc. I also don't want to force that work on the submitters. I honestly don't find their descriptions confusing and none of the hundreds of users who downloaded the data complained.

stranak commented 5 years ago

Added "Sentence " to the beginning of description. I don't know, honestly, how true it is. I would say "mostly". There might be some m:n alignments there. The Description links paper with more details.

twagoo commented 5 years ago

@stranak I think your points are not unfair but considering the VLO as a resource discovery tool, it makes more sense here to look at metadata through the discoverability/retrieval lense than that of 'information completeness'. Don't assume that people want to find out about your resource specifically, but rather that people have certain characteristics in mind and then try to articulate those in order to narrow down on all the things available. If you want your resources to be retrieved this way, that is.

stranak commented 5 years ago

I actually think it would be nice to have facets for corpus → parallel → alignment unit, or something like that. But I would say, "baby steps". Hopefully we will soon have agreement on all records having author and title, preferably in English :-) When that happens we can talk about corpora having records as corpora, not each file in the corpus separately ... and some day maybe we will have that component for corpora and agree we all use it. 😉

My 2 ¢.

twagoo commented 5 years ago

Jan Odijk identified a need for deeper taxonomies for discovery of software, which in many way is quite a similar type of use case. And we have one of the main ingredients, namely conditional facets on the roadmap, more or less. So something like this might be around fewer corners than you think🤞

stranak commented 5 years ago

It is all a question of a sensible compromise. I am in general not on the side of "lets make super complicated profiles with dozens of obligatory facets". I am rather skeptical of it, in fact. I believe there is a deep reason why we don't use "web catalogues" with many facets any more, like we used to, and we use google instead.

That being said, I am still open to that corpus component, if it adds 2 clicks and <10 sec. to the submission workflow :-)