Closed jnehring closed 8 years ago
@nilesh-c can you please do this? It is a requirement for the speed measuring.
why not reusing GERBIL datasets? there are 14 corpora that we can re-use, so if we evaluate on these corpora then we will comparable results with the other systems. For this reason, I strongly recommend to use the datasets available via GERBIL.
First we have to decide whether to use Gerbil or FREME NER (see https://github.com/freme-project/technical-discussion/issues/64).
For load testing any corpus will do that has the right size and meets the other requirements.
First we have to decide whether to use Gerbil or FREME NER (see freme-project/technical-discussion#64).
FREME NER or GERBIL? I dont understand.
For load testing any corpus will do that has the right size and meets the other requirements.
Where are the requirements documented?
FREME NER or GERBIL? I dont understand.
Sorry I mean Apache JMeter or GERBIL.
Where are the requirements documented?
In this issue in the first post from 13 days ago.
In this issue in the first post from 13 days ago. Oh sorry I somehow missed it.
So here are my comments on the requirements and the options:
Reasonable size, e.g. 1000 documents (?). A dataset too small does not produce reliable results, a dataset too big is not practical
We can use subset from the wiki links corpus - provided by Google. http://wiki-link.nlp2rdf.org/ It contains 40 million mentions and it is over 180GB in size. But first we need to look in the size of the datasets provided by GERBIL. I think they match this requirement - 1000 docs.
For start all documents are in English. Would be nice to have to be able to measure and compare speed in other languages also
Oh, this will be difficult. We are aware about a dataset for German NER which we can use https://raw.githubusercontent.com/AKSW/n3-collection/master/News-100.nt ... but for other languages such datasets are practically not existent - or we don't know about them. We need to do some research on this.
Amount of words in total / each document
This is available, or it can be easily computed. Most of the datasets available via GERBIL are documented in papers. E.g. the N3 corpus is documented here http://svn.aksw.org/papers/2014/LREC_N3NIFNERNED/public.pdf with all required stats.
Size of dataset in MB in total / each document
The same - available or can be easily computed.
To summarize, for big data performance analysis one can re-use following datasets:
Can we close this issue?
Thank you :)
Minutes: Towards processing 100k documents per hour with FREME NER