freme-project / technical-discussion

This repository is used for technical discussions.
2 stars 0 forks source link

Specification of e-Entity v2.0 #24

Closed m1ci closed 9 years ago

m1ci commented 9 years ago

E-Entity spec for the 2nd prototype

Summarization of the core feature for the 2nd prototype.

1. Support for additional languages

English, German, French, Spanish - if time permits we will include also Dutch and Italian. The language parameter will be optional with default value set to English (en). The possible values of the parameter will be: en, de, fr, es

2. Support for integration of external knowledge bases/datasets

This includes following features:

Feat 1: Upload of a dataset: a user could upload a SKOS dataset containing entity labels.

Request:

POST /e-entity/datasets
data: RDF SKOS document

Response:

HTTP/1.1 200 OK
Location: http://api.freme-project.com/e-entity/our-ner/datasets/{dataset-id}
Feat 2: Linking against a dataset: a user could upload a SKOS dataset containing entity labels.
POST /e-entity/our-ner/?dataset={dataset-id}
data: RDF (or as part of the input param)

Response:

RDF/NIF - same as before

3. Support for additional formats

Following formats will be supported: N3, Ntriples, rdf/xml

4. Integration of additional NER

We will integrate Entityclassifier.eu (supports English, German, Dutch) _Important: can be used for non-commercial purposes only!!!_

jnehring commented 9 years ago

Good specification. Some comments / suggestions

1. Design of a dataset

I suggest a dataset in FREME 0.2 consists of the following fields. In future versions more fields will be added, at least when we have user management.

So I suggest that we use JSON to encode / decode datasets. This means we submit JSON to the REST API to create / update datasets and we retrieve datasets in JSON also.

2. API endpoints for SKOS datasets

We will introduce user management in FREME 0.3 (or later). Until then any user can read / write any dataset.

3. Expiration time for datasets

I expect users to upload many datasets while playing around with the API. I expect them also to not delete these datasets. Datasets can be potentially large. So I suggest we introduce a parameter expiration that defaults to 24 (hours). So datasets will be deleted after one day. expiration=-1 makes a dataset permanent.

4. Support for additional NIF formats

This is not part of e-Entity but part of the common codes. When you use the common codes that I wrote for e-Translation you dont have to change a single line of code to support additional formats.

5. Question

In "Feat 2" you mention a parameter "our-ner". What does that mean? Do you want to introduce separate endpoints for dbpedia spotlight and entityclassifier.eu?

6. Question

Can we make a reasonable estimation about the size of such a SKOS dataset?

m1ci commented 9 years ago

I suggest a dataset in FREME 0.2 consists of the following fields. In future versions more fields will be > added, at least when we have user management.

id: integer content (the dataset itself): string (potentially big) creationDate: date description: string (free text) expiration date (see below)

SKOS means usage of RDF, so I don't understand your suggestions.

  1. API endpoints for SKOS datasets

POST /e-entity/datasets upload a dataset in JSON. The endpoint returns the newly created dataset in JSON

The size of a dataset can be 1, 10, 50, 100 MB or more. So no sense to return the dataset - just its creation.

GET /e-entity/datasets retrieve all datasets in JSNO (only IDs / metadata, not the template string so the response is not too large)

If only metadata, then the correct way of modelling this should be

GET /e-entity/datasets/metadata

Also, why JSON? We work with RDF, so the metadata of the datasets can be described using the Data ID ontology. See examples here http://aksw.org/dataid.ttl#kore50-nif.ttl So the output will be RDF.

GET /e-entity/datasets/{id} retrieves one complete dataset in JSON.

Again, the dataset size can vary, so I am not sure if this is something we want.

REMOVE /e-entity/datasets/{id} delete a dataset. The endpoint returns the complete removed dataset in JSON

I guess you meant DELETE. I agree with deletion, but I don't think some data should be returned. It breaks the REST principle: idempotence of the method DELETE.

*POST /e-entity/datasets/{id} update a dataset. The endpoint returns the complete updated dataset in JSON

For update should be used PUT. Again, I don't think it makes sense to return 100 MB large dataset.

  1. Expiration time for datasets

I expect users to upload many datasets while playing around with the API. I expect them also to not delete these datasets. Datasets can be potentially large. So I suggest we introduce a parameter expiration that defaults to 24 (hours). So datasets will be deleted after one day. expiration=-1 makes a dataset permanent.

We have enough on the TODO list for e-Entity. Lets leave this for following prototypes.

  1. Question

In "Feat 2" you mention a parameter "our-ner". What does that mean? Do you want to introduce separate endpoints for dbpedia spotlight and entityclassifier.eu?

Niether DBpedia Spotligh nor Entityclassifier.eu provide such feature: integration of external KBs. It means we should implement own NER system with such feature. This is something we are working on currently and this is our highest priority, along with the support for additional languages.

  1. Question

Can we make a reasonable estimation about the size of such a SKOS dataset?

No, its impossible. It can be few MB but also few GB. Depends on the size of the dataset the BC will want to use.

jnehring commented 9 years ago

dataset metadata

We can encode the metadata also in Data ID ontology format instead of a native JSON format. The use of a SKOS file is for advanced users that know about RDF so this approach is fine.

We should not store metadata and the rest of the dataset in one big SKOS file. There is some operations that work only on the metadata, e.g. "return all templates created by a specific user". It is not efficient to perform these operations through reading all SKOS files from disk. I suggest to use MySQL database for metadata and a file on harddisk for the skos file.

Design of API endpoints

You are right: We should return the whole dataset only when it is necessary and not upon any POST / PUT / DELETE request. And I agree with you one more time: When I wrote REMOVE and POST I meant DELETE and PUT.

I understand that for the novel NER system (by the way, do you have a name for that thing?) the "dataset" parameter is optional? It should be possible to use e-Entity on other languages without providing a skos dataset.