Test performance of standoff searches with large texts

dasch-swiss / dsp-api

DaSCH Service Platform API

http://admin.dasch.swiss

Apache License 2.0

74 stars 18 forks source link

Test performance of standoff searches with large texts #1112

Open benjamingeer opened 5 years ago

benjamingeer commented 5 years ago

This needs large texts containing lots of markup.

benjamingeer commented 5 years ago

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

tobiasschweizer commented 5 years ago

I think it won‘t be hard if the standoff is created from XML. We could just copy and paste an existing text several times inside the same XML doc, reusing the same structure and mapping.

benjamingeer commented 5 years ago

We could just copy and paste an existing text several times

I think that for a realistic performance test, it would be better not to repeat the same content.

tobiasschweizer commented 5 years ago

Ok, then let‘s create a huge text by combining existing texts that have the same mapping

mrivoal commented 5 years ago

A digital edition corpus might not be the most demanding text for performance tests. As discussed with @tobiasschweizer and 2 researchers from a new SNF project, corpora from linguistics are probably more intensively tagged.

We (meaning @loicjaouen) were also thinking about creating a fake corpus tagged with NLP libraries, such as the one provided by Standford, which, according to one of the researcher produced 8 different tags. But we were planning this for the end of july.

benjamingeer commented 5 years ago

we were planning this for the end of july.

That would be great. No reason we can't test with both kinds of texts.

mrivoal commented 5 years ago

Yes, exactly!

SepidehAlassi commented 5 years ago

@tobiasschweizer @SepidehAlassi To make a large text containing lots of markup, could we combine a lot of BEOL texts into one text? How hard do you think that would be?

@benjamingeer it is easy to do, I can combine all of the Euler correspondence texts which are full of markup. when do you need it?

benjamingeer commented 5 years ago

I downloaded 50 large books from Project Gutenberg. Each is at least 500 MB, many are over 1 MB.

The current plan is to use knora-py to create a simple ontology for these books, and to add markup (using the standard mapping):

On each word.
On each sequence of 10 words (simulating sentences).

Then test:

Retrieving a book without markup.
Retrieving a book with markup.
Searching for a book using full-text search.
Searching for a book using standoff in Gravsearch.

The goal is to provide some guidelines about how and when to split up large texts into multiple text values.

mrivoal commented 5 years ago

We initially hoped that we could help you to test this. However, the project I mentioned earlier won't use Knora as a research tool in the course of the project, but rather as a data curation tool, at the end of the project.

So, we are not going to simulate a corpus tagged with NLP libraries right now.