elastic / elasticsearch-mapper-attachments

Mapper Attachments Type plugin for Elasticsearch
https://www.elastic.co
Apache License 2.0
504 stars 94 forks source link

Failed to index metadata for epub #30

Closed cgravier closed 11 years ago

cgravier commented 11 years ago

Dear all,

I suspect there is a bug in the interaction with apache tika-app when dealing with epub.

Here are the steps you can reproduce to stres the suspected bug. Please correct me if I missed something. In case this this bug report is a false positive, this still could serve as an "how-to" for the citizen. I simplified the example as much as possible (no analyzer on purpose).

Given the following free epub : https://github.com/downloads/ieure/sicp/sicp.epub

If I run apache tika-app for metadata, I run the following command :

java -jar tika-app-1.3.jar -m sicp.epub 

That outputs :

Author: Harold Abelson and Gerald Jay Sussman with Julie Sussman
Content-Length: 1211674
Content-Type: application/epub+zip
creator: Harold Abelson and Gerald Jay Sussman with Julie Sussman
dc:creator: Harold Abelson and Gerald Jay Sussman with Julie Sussman
dc:identifier: 0-262-01153-0
dc:language: en-US
dc:publisher: MIT Press
dc:rights: Creative Commons Attribution-Noncommercial 3.0 Unported License.
dc:title: Structure and Interpretation of Computer Programs
identifier: 0-262-01153-0
language: en-US
meta:author: Harold Abelson and Gerald Jay Sussman with Julie Sussman
publisher: MIT Press
resourceName: sicp.epub
rights: Creative Commons Attribution-Noncommercial 3.0 Unported License.
title: Structure and Interpretation of Computer Programs

This is good.

Now Let us aim at indexing it in Elastic search.

First, create our mapping. Here is what I am using inside a bash script :

host=localhost:9200
curl -X DELETE "${host}/myepubcontents"
curl -X GET "${host}/_cluster/health?wait_for_status=green&pretty=1&timeout=5s"
curl -X PUT "${host}/myepubcontents" -d '{
  "settings" : { 
        "index" : { "number_of_shards" : 5, "number_of_replicas" : 0 }
    }
  }
}'

curl -X PUT "${host}/myepubcontents/books/_mapping" -d '{
  "books" : {
    "_all" : {"enabled" : false},
    "properties" : {
      "file" : {
          "type" : "attachment",
          "fields" : {
            "date": { "store": "yes" },
            "author": { "store": "yes" },
            "title" : { "store" : "yes" },
            "file" : { "store":"yes" }
          }
      }
    }
  }
}'

Now that my elastic search got mappings, I will push the epub to it. For this, we need to encode in base 64 and send it via a curl. Given limitation of curl, we need to serialise the base64 encoded content in a file on the filesystem, then use this file as an argument to curl. Here is how I proceed (for the sake of the example, I ran the command in a bash interpreter, outside any batch script) :

coded=`cat sicp.epub | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
echo "{\"file\":\"${coded}\"}" > sicp-b64.json
curl -X POST "localhost:9200/myepubcontents/books" -d @sicp-b64.json

I know want to query ElasticSearch for this content. Here is my query :

curl -XGET 'http://localhost:9200/myepubcontents/books/_search?pretty=true&explain=false' -d '{ "fields" : [ "title", "author", "date" ],
   "query": {
      "query_string" : {
         "fields" : ["file", "title", "author" ],
         "query" : "a*",
         "use_dis_max" : true }
   },
   "highlight" : {
      "fields" : {
         "file" : {}
      }
    }
}'

It outputs :

{
  "took" : 118,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "myepubcontents",
      "_type" : "books",
      "_id" : "C0eqQD7iRLeb5HBhYoFeeA",
      "_score" : 1.0,
      "highlight" : {
        "file" : [ " numbers.\n<em>Also</em>, in real computers, <em>arithmetic</em> operations are <em>almost</em> <em>always</em>\nperformed with limited", " we are <em>asked</em> to <em>apply</em> the procedure to some\n<em>argument</em>, we first look to see if the value is <em>already</em>", " <em>advantages</em>, however.  One of\nthem is that it can <em>accommodate</em> procedures that may take an <em>arbitrary</em>", "\n\n\n\n\n\n\nNo <em>ambiguity</em> can <em>arise</em>, because the operator is <em>always</em> the leftmost\nelement and the entire", " <em>appear</em> in\n<em>any</em> powerful programming language:\n\n\n\n\n\n\n\n\n\t\tNumbers and <em>arithmetic</em> operations are\nprimitive" ]
      }
    } ]
  }
}

Basically, I expected metadata such as author, date and title to pop as attributes within the scope of "fields" on the previous json snip.

I tried with lot of epub files, and I found none that were correctly indexed, speaking of there metadata.

Thanks in advance, @cgravier

cgravier commented 11 years ago

I can add that mapping and query seems OK, since when I add a second document (PDF in this case) with the same previous method, i.e.:

coded=`cat download/sherlock_holmes.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
echo "{\"file\":\"${coded}\"}" > sk-b64.json
curl -X POST "localhost:9200/myepubcontents/books" -d @sk-b64.json

I can query again elastic search, the two docuemtns hits, but the PDF is fetched along its date, author, and title, while the epub is not :

curl -XGET 'http://localhost:9200/myepubcontents/books/_search?pretty=true&explain=false' -d '{ "fields" : [ "title", "author", "date" ],
>    "query": {
>       "query_string" : {
>          "fields" : [ "file" ],
>          "query" : "a*",
>          "use_dis_max" : true }
>    },
>    "highlight" : {
>       "fields" : {
>          "file" : {}
>       }
>     }
> }'
{
  "took" : 79,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "myepubcontents",
      "_type" : "books",
      "_id" : "dcHGZRSPRZW_JjpHJ4S7QQ",
      "_score" : 1.0,
      "fields" : {
        "file.date" : "2011-02-22T02:08:19.000Z",
        "file.author" : "Arthur Conan Doyle",
        "file.title" : "The Adventures of Sherlock Holmes"
      },
      "highlight" : {
        "file" : [ " of \n\nit. \n\n\"Eight weeks passed <em>away</em> like this, and I had written <em>about</em> <em>Abbots</em> and <em>Archery</em> and \n\n<em>Armour</em>", " and predominates the whole of her \n\nsex. It was not that he felt <em>any</em> emotion <em>akin</em> to love for Irene <em>Adler</em>. <em>All</em>", ". They were <em>all</em> three standing in a knot in front of the \n\n<em>altar</em>. I lounged up the side <em>aisle</em> like <em>any</em>", " the <em>acquaintance</em> of the well-known <em>adventuress</em>, Irene <em>Adler</em>. The name is no \n\ndoubt familiar to you.\" \n\n\"Kindly", " to.\" \n\n\"And what of Irene <em>Adler</em>?\" I <em>asked</em>. \n\n\"Oh, she has turned <em>all</em> the men's heads down in that part. She" ]
      }
    }, {
      "_index" : "myepubcontents",
      "_type" : "books",
      "_id" : "-oggjyFPQFqrMh3kN9giDw",
      "_score" : 1.0,
      "highlight" : {
        "file" : [ " numbers.\n<em>Also</em>, in real computers, <em>arithmetic</em> operations are <em>almost</em> <em>always</em>\nperformed with limited", " we are <em>asked</em> to <em>apply</em> the procedure to some\n<em>argument</em>, we first look to see if the value is <em>already</em>", " <em>advantages</em>, however.  One of\nthem is that it can <em>accommodate</em> procedures that may take an <em>arbitrary</em>", "\n\n\n\n\n\n\nNo <em>ambiguity</em> can <em>arise</em>, because the operator is <em>always</em> the leftmost\nelement and the entire", " <em>appear</em> in\n<em>any</em> powerful programming language:\n\n\n\n\n\n\n\n\n\t\tNumbers and <em>arithmetic</em> operations are\nprimitive" ]
      }
    } ]
  }
}
spinscale commented 11 years ago

the tika version used in the plugin seems to be 1.2 - maybe support for epub has been added later?

dadoonet commented 11 years ago

I tested it with the current version in master (with Tika 1.4) and get it worked. The issue is not about Tika version but you hit the max length (default to 100000).

You need to increase that value in order to get all content (including metadata as well).

I set for example "index.mapping.attachment.indexed_chars":-1 and get all expected metadata.

cgravier commented 11 years ago

Thanks for the tip !