MSherif / mlode

Automatically exported from code.google.com/p/mlode
0 stars 0 forks source link

Add dataset IDS #50

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Dataset can be found here:

http://lingweb.eva.mpg.de/ids/

I also have it in XML format with links to a central "concepticon". 

Original issue reported on code.google.com by bamboofo...@gmail.com on 6 Aug 2012 at 1:17

GoogleCodeExporter commented 9 years ago

Original comment by kur...@googlemail.com on 14 Aug 2012 at 10:50

GoogleCodeExporter commented 9 years ago
The dataset looks promising to me, but I'm no linguist. The problem I see is 
linking the different languages to iso-codes or other resources. In the format 
available on the website, we only have the language names to use. Wiktionary 
links would also be nice, but the problem stays the same.

Is there any more data in the xml-file?
What's a "concepticon"?

Original comment by der.brue...@googlemail.com on 14 Aug 2012 at 12:04

GoogleCodeExporter commented 9 years ago
There's a project called "LEGO" 

http://lego.linguistlist.org/

that converted the IDS wordlists into a "LIFT" XML representation

http://code.google.com/p/lift-standard/

and added the language code and other metadata.

The words in the wordlists are linked to a central "concepticon", see:

http://www.aclweb.org/anthology/W/W10/W10-2101.pdf

that is in RDF.

Original comment by bamboofo...@gmail.com on 14 Aug 2012 at 12:13

GoogleCodeExporter commented 9 years ago
Sounds good. Could you upload the XML data or send it to me?

Original comment by der.brue...@googlemail.com on 14 Aug 2012 at 12:36

GoogleCodeExporter commented 9 years ago
Upload where? Just send me an email and we can work out how to get you the data.

Original comment by bamboofo...@gmail.com on 14 Aug 2012 at 12:45

GoogleCodeExporter commented 9 years ago
I have worked on the lingtyp ontology we talked about during the workshop in 
March. As discussed with Steve and Martin B., the idea is to see typological 
features as properties. I thus adeed WALS, IDS, Numerals and ASJP to an 
ontology. This is not close to anything finished, but you might find it 
interesting

http://galoes.org/ontologies/lingtyp-full.owl

The bare thing without IDS etc can be found at
http://galoes.org/ontologies/lingtyp.owl

The idea would obviously be to import lingtyp.owl into ids.owl etc. 

I suppose there is some duplication with the existing concepticon. 

Original comment by sebastia...@googlemail.com on 14 Aug 2012 at 3:04

GoogleCodeExporter commented 9 years ago
Lego's licence is cc-nc-nd, so an RDF conversion (being a derivative) is out of 
the question without specific permission allowing it.

Original comment by joregan on 14 Aug 2012 at 5:09

GoogleCodeExporter commented 9 years ago
But we're not working with LEGO wordlists since they aren't published. We're 
working with the IDS wordlists from MPI-EVA.

Original comment by bamboofo...@gmail.com on 14 Aug 2012 at 5:14

GoogleCodeExporter commented 9 years ago
cc-nc-nd does not preclude conversion into other formats, if I remember 
correctly. From a post on[open-linguistics]:

https://creativecommons.org/licenses/by-nd/3.0/legalcode does include 
"The above rights may be exercised in all media and formats whether
now known or hereafter devised. The above rights include the right to
make such modifications as are technically necessary to exercise the
rights in other media and formats, but otherwise you have no rights to
make Adaptations."

Original comment by sebastia...@googlemail.com on 15 Aug 2012 at 8:19

GoogleCodeExporter commented 9 years ago
I will have a meeting with Bernard Comrie, director of MPI-EVA and responsible 
for IDS, later this month regarding license issues. Since IDS is currently 
available as HTML on the servers of MPI-EVA, there should be no problem with 
serving RDF as well. As far as reuse of the data is concerned, I am currently 
not in a position to foresee the outcome of this meeting. 
In order to prepare the meeting, could you give the following information:

- should the dump be hosted by MPI-EVA or elsewhere?
- what kind of applications using IDS data do you foresee?
- what kind of license would you recommend, and why?
- how would updates be managed?

I have certain ideas about some of those questions, but if the answers come 
from an outside body, this would be better for purposes of negotiation

Original comment by sebastia...@googlemail.com on 15 Aug 2012 at 8:24

GoogleCodeExporter commented 9 years ago
The IDS data is freely downloadable, but you're right, there's not
explicit license on the site. However, LEGO used it, enriched it with
metadata, and put it in XML. Arguably it's easier to extract it from
that XML LIFT format than it is to download it all and parse it from
the site. The enrichment links the words in IDS to a centralized
"concepticon", as I mentioned above, that we do have permission from
Jeff Good to use in LLOD.

Additionally, if/when LEGO releases the other 2700 wordlists, since
they are also in XML and linked to the concepticon, then any work we
do extracting the IDS from this LIFT standard could then "easily" be
used to convert the LEGO wordlists to RDF.

One thing that might be an issue is that I heard there's possibly even
more up-to-date IDS data than what is on the website. I pinged
Hans-Joerg but haven't received a response.

If we are allowed to convert the IDS data to RDF, I think we should
offer to give it back to their project so they can also let users
download the RDF.

Original comment by bamboofo...@gmail.com on 15 Aug 2012 at 8:41

GoogleCodeExporter commented 9 years ago
> Subj:IDS license
>
> Sebastian:
>
> On the basis of the responses I got on this (which were not all mutually
> consistent), I have decided that we should go with CC-BY-SA, which was
> one of the options envisaged by you.
>
> Bernard

Original comment by sebastia...@googlemail.com on 28 Aug 2012 at 9:08

GoogleCodeExporter commented 9 years ago
HJ Bibiko has given me a dump of the IDS db, which I forwarded to Martin 
Brümmer. 

Original comment by sebastia...@googlemail.com on 29 Aug 2012 at 11:38

GoogleCodeExporter commented 9 years ago
First conversion is done, CKAN entry can be found here:

http://thedatahub.org/dataset/ids_dictionary

Diagram of the model can be found here: 
https://dl.dropbox.com/u/65483422/ids-model-diagram.png

Opening new issue for validation and further interlinking.

Original comment by der.brue...@googlemail.com on 3 Sep 2012 at 12:43

GoogleCodeExporter commented 9 years ago
Correction: correct CKAN entry is here: http://thedatahub.org/dataset/ids. 

Issue for interlinking and refinement: 
http://code.google.com/p/mlode/issues/detail?id=94&colspec=ID%20Type%20Status%20
Priority%20Owner%20Dataset%20Summary%20Modified%20Reporter

Original comment by der.brue...@googlemail.com on 3 Sep 2012 at 1:03

GoogleCodeExporter commented 9 years ago
can ids:XYtranslation be complemented by dcterms;relation xy.wiktionary or 
xy.wordnet? The vocabulary is basic, so most links should work out of the box.

Instead of dcterms:relation one could probably also use some lemon predidate 
(deferring to JMcC)

Original comment by sebastia...@googlemail.com on 3 Sep 2012 at 1:31

GoogleCodeExporter commented 9 years ago
Some of the translations contain 2 words, words in brackets etc. The basic 
conversion was done with d2rq, so further links will be added with a script to 
validate the links before adding them to the dataset. Please continue the 
refinement and interlinking discussion here:
http://code.google.com/p/mlode/issues/detail?id=94&colspec=ID%20Type%20Status%20
Priority%20Owner%20Dataset%20Summary%20Modified%20Reporter

Original comment by der.brue...@googlemail.com on 3 Sep 2012 at 1:35