dkt-projekt / DocumentStorage

0 stars 0 forks source link

Add Mendelsohn archive to the DocumentStorage #11

Open jnehring opened 8 years ago

jnehring commented 8 years ago

I created an example of the Mendelsohn collection. I extracted all letters from the "handschrift" table, resulting in 2800 files, put them in a ZIP of 3 MB and uploaded them to the DocumentStorage.

Error messages

Within seconds it processed all the files. This is very quick, maybe there is a problem. 135 files failed with errors

{
  "exception": "eu.freme.common.exception.BadRequestException",
  "path": "/e-sesame/storeData",
  "message": "Unable to generate directory: /opt/storage/sesameStorage/mendelsohn",
  "error": "Bad Request",
  "status": 400,
  "timestamp": 1472045365668
}

Examining the e-Sesame

Counting all triples in e-Sesame reveals 2681 triples. This is not enough, i would expect something around 10,000 even if there are no annotations:

retrieving all NIF contexts reveals

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos/> .
@prefix time: <http://www.w3.org/2006/time#> .

<http://dkt.dfki.de/documents/#char=0,410> a nif:Context .

<http://dkt.dfki.de/documents/#char=0,411> a nif:Context .

<http://dkt.dfki.de/documents/#char=0,412> a nif:Context .

<http://dkt.dfki.de/documents/#char=0,409> a nif:Context .

Which is wrong for two reasons:

jnehring commented 8 years ago

I fixed a bug and now the problems I reported in this issue about examining the sesame are fixed. Examining the e-Sesame again I get a lot of output like this:

        <result>
            <binding name='p'>
                <uri>http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#isString</uri>
            </binding>
            <binding name='s'>
                <uri>http://dkt.dfki.de/documents/#char=0,764</uri>
            </binding>
            <binding name='o'>
                <literal>@prefix xsd:   &lt;http://www.w3.org/2001/XMLSchema#&gt; .
@prefix nif:   &lt;http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#&gt; .

&lt;http://digitale-kuratierung.de/ns/100.txt#char=0,328&gt;
        a               nif:RFC5147String , nif:Context , nif:String ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "328"^^xsd:nonNegativeInteger ;
        nif:isString    "[Kartenbrief Anschrift/Absender]\r\nAn\r\nFr��ulein\r\nLuise Maas\r\nin Rottach b/Tegernsee\r\nWohnung Adr. Maler A. Weilhammer\r\nAdresse des Absenders:  M��nchen\r\nAgnesstr. 52.II.1.\r\n\r\nM��nchen 13.VIII.12\r\nBin 10.51 Uhr morgen\r\n-  Mittwoch - Vormittag\r\nin Tegernsee. \r\nSeien die G��tter uns \r\ngn��diger als heute.\r\n \r\nVon ganzem Herzen.\r\nErich" .
</literal>
            </binding>
        </result>

There are two problems:

  1. Why is the whole NIF document the object of a triple?
  2. What happens to the special characters?
jnehring commented 8 years ago

Problem 1 could be solved changing the pipeline configuration. For problem 2 I raised https://github.com/dkt-projekt/e-Sesame/issues/11

jnehring commented 8 years ago

Now after some updates it processed the mendelsohn collection in about 5 minutes.

Next step: Find out why 30 documents got stuck in CURRENTLY_PROCESSING