freme-project / technical-discussion

This repository is used for technical discussions.
2 stars 0 forks source link

Creation of multilingual datasets #38

Closed jnehring closed 9 years ago

jnehring commented 9 years ago

I assume when we create new multilingual datasets that we create two types of datasets

a) Datasets to be used by e-Link. These can contain any linked data. b) Datasets to be used by FREME-NER. These need to be in the SKOS format.

My question: Is it possible to create the datasets in a way that they are compatible to both a) and b)? I think that any type of linked data (use case a)) is compatible with b) as long as it contains the skos:prefLabel annotation.

@freme-project/infai What do you think? Can we create a best practice for creation of datasets so they are compatible to e-link and FREME NER?

der-bruemmer commented 9 years ago

I think it shouldn't be too hard. The question to developers of services ( @nilesh-c ) is, what they are able to consume and what properties they need. These properties would then have to be documented somewhere as essential and used in conversion of future datasets.

So what kind of data is already being used in e-Link and FREME-NER? Which vocabularies are used, which properties?

What would also be needed is a fallback in case the property is absent in legacy datasets. For example, most NIF NER datasets contain DBpedia links for disambiguation. However, DBpedia does not use SKOS. One would have to define that in absence of skos:prefLabel, rdfs:label or dbpedia:name or foaf:name etc should be used.

A possible way to go would be to quickly and informally specify the needs in a kind of reference card or cheat sheet, so development and conversion can continue.

fsasaki commented 9 years ago

These datasets are available http://api.freme-project.eu/0.2/e-entity/freme-ner/datasets/ [{ "Name": "dbpedia", "TotalEntities": 7000000, "CreationTime": 1435966058019 }, { "Name": "tildeset", "TotalEntities": 4, "CreationTime": 1436951248448 }, { "Name": "onld", "TotalEntities": 2106, "CreationTime": 1436954427867 }, { "Name": "geopolitical", "TotalEntities": 309, "CreationTime": 1436955544109 }] @nilesh-c can provide more infos .

m1ci commented 9 years ago

b) Datasets to be used by FREME-NER. These need to be in the SKOS format.

Having the converted datasets in SKOS is a recommendation, however, many datasets are not using SKOS properties (but FOAF, RDFS, etc.) and the transformation of some of the properties to SKOS is time consuming and not efficient. For this reason, in FREME NER we introduced an optional properties parameter which can be used to identify the properties which convey the entities names. See parameters section and the description of the properties parameter. By default, the skos:prefLabel, skos:altLabel and rdfs:label are considered

My question: Is it possible to create the datasets in a way that they are compatible to both a) and b)? I think that any type of linked data (use case a)) is compatible with b) as long as it contains the skos:prefLabel annotation.

  • 1) E-Link is consuming RDF datasets (described using different vocabularies - RDFS, SKOS, FOAF, etc) through a SPARQL endpoint. So any datasets which is offered through a SPARQL endpoint can be used for enrichment in the e-Link service.
  • 2) E-Entity is also using RDF datasets (described using different vocabularies) to perform entity linking against them. The information of this datasets can be described using the SKOS vocabulary but also FOAF, RDFS, etc. or proprietary defined properties such as dbpedia:name.

So the answer will be: any RDF can be consumed by e-Link and e-Entity. The only requirement is that the e-Link consumes data available via SPARQL endpoints, while e-Entity (at the moment) consumes data available via dumps. Enabling e-Entity to consume data from a SPARQL endpoint is on the TODO list for FREME 0.3

jnehring commented 9 years ago

Enabling e-Entity to consume data from a SPARQL endpoint is on the TODO list for FREME 0.3

We decided to add only a subset of features from this list to FREME 0.3 and not this feature. But we are going to add it in a future version.