Find dataset for big data performance analysis

jnehring commented 9 years ago

Reasonable size, e.g. 1000 documents (?). A dataset too small does not produce reliable results, a dataset too big is not practical
For start all documents are in English. Would be nice to have to be able to measure and compare speed in other languages also
statistics about the dataset
- Amount of words in total / each document
- Size of dataset in MB in total / each document

Minutes: Towards processing 100k documents per hour with FREME NER

jnehring commented 9 years ago

@nilesh-c can you please do this? It is a requirement for the speed measuring.

m1ci commented 9 years ago

why not reusing GERBIL datasets? there are 14 corpora that we can re-use, so if we evaluate on these corpora then we will comparable results with the other systems. For this reason, I strongly recommend to use the datasets available via GERBIL.

jnehring commented 9 years ago

First we have to decide whether to use Gerbil or FREME NER (see https://github.com/freme-project/technical-discussion/issues/64).

For load testing any corpus will do that has the right size and meets the other requirements.

m1ci commented 9 years ago

First we have to decide whether to use Gerbil or FREME NER (see freme-project/technical-discussion#64).

FREME NER or GERBIL? I dont understand.

For load testing any corpus will do that has the right size and meets the other requirements.

Where are the requirements documented?

jnehring commented 9 years ago

FREME NER or GERBIL? I dont understand.

Sorry I mean Apache JMeter or GERBIL.

Where are the requirements documented?

In this issue in the first post from 13 days ago.

m1ci commented 9 years ago

In this issue in the first post from 13 days ago. Oh sorry I somehow missed it.

So here are my comments on the requirements and the options:

Reasonable size, e.g. 1000 documents (?). A dataset too small does not produce reliable results, a dataset too big is not practical

We can use subset from the wiki links corpus - provided by Google. http://wiki-link.nlp2rdf.org/ It contains 40 million mentions and it is over 180GB in size. But first we need to look in the size of the datasets provided by GERBIL. I think they match this requirement - 1000 docs.

For start all documents are in English. Would be nice to have to be able to measure and compare speed in other languages also

Oh, this will be difficult. We are aware about a dataset for German NER which we can use https://raw.githubusercontent.com/AKSW/n3-collection/master/News-100.nt ... but for other languages such datasets are practically not existent - or we don't know about them. We need to do some research on this.

Amount of words in total / each document

This is available, or it can be easily computed. Most of the datasets available via GERBIL are documented in papers. E.g. the N3 corpus is documented here http://svn.aksw.org/papers/2014/LREC_N3NIFNERNED/public.pdf with all required stats.

Size of dataset in MB in total / each document

The same - available or can be easily computed.

m1ci commented 8 years ago

To summarize, for big data performance analysis one can re-use following datasets:

N3 corpus dataset: http://aksw.org/Projects/N3NERNEDNIF.html
DBpedia Spotlight dataset: http://www.yovisto.com/labs/ner-benchmarks/data/dbpedia-spotlight-nif.ttl
KORE 50 dataset: http://www.yovisto.com/labs/ner-benchmarks/data/kore50-nif-2014-11-03.ttl
Datasets from the OKE challenge: https://github.com/anuzzolese/oke-challenge/tree/master/evaluation-data
MSNBC corpus: https://github.com/AKSW/gerbil/blob/e5e471b42ba72775c6e1883415b487cec73830ed/src/main/resources/dataId/corpora/MSNBC.ttl, or
subset from the wiki links corpus (provided by Google). http://wiki-link.nlp2rdf.org/. It contains 40 million mentions and it is over 180GB in size.

Can we close this issue?

jnehring commented 8 years ago

Thank you :)

freme-project / freme-ner

Find dataset for big data performance analysis #24