Make datasets available

jnehring commented 8 years ago

In case we want to make the FREME NER datasets available to others, what do we need to do? With making them available I mean that other FREME users like the ADAPT research centre can easily upload our datasets into their FREME NER installation?

I guess we need to provide the SKOS files for these datasets. It would also be nice to have an installer script.

sandroacoelho commented 8 years ago

Hi @jnehring ,

For the data, we have two options:

Prepare a small script that reads our index and generate a dump to be imported or
Turn our index available to download in binary format ( Solr/Lucene format)

What about to delivery all FREME solution through docker composer (Broker, Link , Publish, NER,Solr etc) ?

m1ci commented 8 years ago

I like the idea to provide FREME as docker

Prepare a small script that reads our index and generate a dump to be imported or Turn our index available to download in binary format ( Solr/Lucene format)

Isnt it better to have a script-per-dataset which loads the data into Solr via FREME NER?

sandroacoelho commented 8 years ago

Hi @m1ci , @jnehring

We can use a simple shell script to do this dirty job for us. Is also possible to split the dump by dataset as Milan suggested

Please take a look on this snippet

m1ci commented 8 years ago

thanks @sandroacoelho for this. Sharing dumps is fine, however, the source dataset can evolve over the time and we might want to integrate it from scratch. My idea was to have script-per-dataset such as https://github.com/freme-project/freme-ner/blob/master/index-loc-authors.py which loads RDF into Solr via FREME NER. The execution of all these scripts can be triggered by one single bash script.

jnehring commented 8 years ago

In last developers call we agreed on creating a shell script for each dataset.

sandroacoelho commented 8 years ago

Hi @jnehring, @m1ci

We have two shell scripts to do this job for us, as follow:

freme-ner-dump.sh : extracts all non-DBpedia datasets in chunks of 1000000
freme-ner-dump-to-dataset.sh: read and categorizes all dump files by dataset / language

I can say that freme-ner-dump-to-dataset.sh is in Beta version. I have been facing some problems to deal with single quotes, quotes and some others special characters in label fields that I hope to solve ASAP.

Our first extraction are available here.

Best,

jnehring commented 8 years ago

Thank you! I think the script reads the datasets that are currently in freme-ner and then creates dumps of these scripts. Am I right?

One more question: Why did you exclude DBPedia?

I think the next steps are

Upload these datasets to our FREME server (I will do this as soon as all datasets are ready)
Create a script that uploads all datasets together, maybe using this script.
Write an article about how to upload a datase. This article will cover also #125 later.

sandroacoelho commented 8 years ago

Hi @jnehring

Thank you! I think the script reads the datasets that are currently in freme-ner and then creates dumps of these scripts. Am I right?

Yes.

One more question: Why did you exclude DBPedia?

We can download DBpedia labels dataset from here, but if you think that is important to include this data in our dump, please let me know. I will just remove a parameter in the query

I think the next steps are

Upload these datasets to our FREME server (I will do this as soon as all datasets are ready) Create a script that uploads all datasets together, maybe using this script. Write an article about how to upload a datase. This article will cover also #125 later.

Great. Before you start, give me a chance to finish a docker for our Solr instance . There we have a tested solution for our issue #91

Best,

jnehring commented 8 years ago

The datasets will be uploaded in #130.

jnehring commented 8 years ago

One question: The datasets you created, are they any different from the datasets already available at http://api.freme-project.eu/doc/current/api-doc/list-datasets.html ?

m1ci commented 8 years ago

what do you mean by different? Btw, there is also:

loc-authors (used by wripl) - The Library of Congress datasets provides author names.
europeana (used by iMinds) - Europeana.eu is an internet portal that acts as an interface to millions of books, paintings, films, museum objects and archival records that have been digitised throughout Europe.
gwpp-glossary (used by AK) - A custom vocabulary.

jnehring commented 8 years ago

what do you mean by different?

I was afraid that the work done in this task was unnecessary. But now I see that there are some datasets that would not be available without this work.

freme-project / freme-ner

Make datasets available #114