Closed cgravier closed 11 years ago
I can add that mapping and query seems OK, since when I add a second document (PDF in this case) with the same previous method, i.e.:
coded=`cat download/sherlock_holmes.pdf | perl -MMIME::Base64 -ne 'print encode_base64($_)'`
echo "{\"file\":\"${coded}\"}" > sk-b64.json
curl -X POST "localhost:9200/myepubcontents/books" -d @sk-b64.json
I can query again elastic search, the two docuemtns hits, but the PDF is fetched along its date, author, and title, while the epub is not :
curl -XGET 'http://localhost:9200/myepubcontents/books/_search?pretty=true&explain=false' -d '{ "fields" : [ "title", "author", "date" ],
> "query": {
> "query_string" : {
> "fields" : [ "file" ],
> "query" : "a*",
> "use_dis_max" : true }
> },
> "highlight" : {
> "fields" : {
> "file" : {}
> }
> }
> }'
{
"took" : 79,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myepubcontents",
"_type" : "books",
"_id" : "dcHGZRSPRZW_JjpHJ4S7QQ",
"_score" : 1.0,
"fields" : {
"file.date" : "2011-02-22T02:08:19.000Z",
"file.author" : "Arthur Conan Doyle",
"file.title" : "The Adventures of Sherlock Holmes"
},
"highlight" : {
"file" : [ " of \n\nit. \n\n\"Eight weeks passed <em>away</em> like this, and I had written <em>about</em> <em>Abbots</em> and <em>Archery</em> and \n\n<em>Armour</em>", " and predominates the whole of her \n\nsex. It was not that he felt <em>any</em> emotion <em>akin</em> to love for Irene <em>Adler</em>. <em>All</em>", ". They were <em>all</em> three standing in a knot in front of the \n\n<em>altar</em>. I lounged up the side <em>aisle</em> like <em>any</em>", " the <em>acquaintance</em> of the well-known <em>adventuress</em>, Irene <em>Adler</em>. The name is no \n\ndoubt familiar to you.\" \n\n\"Kindly", " to.\" \n\n\"And what of Irene <em>Adler</em>?\" I <em>asked</em>. \n\n\"Oh, she has turned <em>all</em> the men's heads down in that part. She" ]
}
}, {
"_index" : "myepubcontents",
"_type" : "books",
"_id" : "-oggjyFPQFqrMh3kN9giDw",
"_score" : 1.0,
"highlight" : {
"file" : [ " numbers.\n<em>Also</em>, in real computers, <em>arithmetic</em> operations are <em>almost</em> <em>always</em>\nperformed with limited", " we are <em>asked</em> to <em>apply</em> the procedure to some\n<em>argument</em>, we first look to see if the value is <em>already</em>", " <em>advantages</em>, however. One of\nthem is that it can <em>accommodate</em> procedures that may take an <em>arbitrary</em>", "\n\n\n\n\n\n\nNo <em>ambiguity</em> can <em>arise</em>, because the operator is <em>always</em> the leftmost\nelement and the entire", " <em>appear</em> in\n<em>any</em> powerful programming language:\n\n\n\n\n\n\n\n\n\t\tNumbers and <em>arithmetic</em> operations are\nprimitive" ]
}
} ]
}
}
the tika version used in the plugin seems to be 1.2 - maybe support for epub has been added later?
I tested it with the current version in master (with Tika 1.4) and get it worked.
The issue is not about Tika version but you hit the max length (default to 100000
).
You need to increase that value in order to get all content (including metadata as well).
I set for example "index.mapping.attachment.indexed_chars":-1
and get all expected metadata.
Thanks for the tip !
Dear all,
I suspect there is a bug in the interaction with apache tika-app when dealing with epub.
Here are the steps you can reproduce to stres the suspected bug. Please correct me if I missed something. In case this this bug report is a false positive, this still could serve as an "how-to" for the citizen. I simplified the example as much as possible (no analyzer on purpose).
Given the following free epub : https://github.com/downloads/ieure/sicp/sicp.epub
If I run apache tika-app for metadata, I run the following command :
That outputs :
This is good.
Now Let us aim at indexing it in Elastic search.
First, create our mapping. Here is what I am using inside a bash script :
Now that my elastic search got mappings, I will push the epub to it. For this, we need to encode in base 64 and send it via a curl. Given limitation of curl, we need to serialise the base64 encoded content in a file on the filesystem, then use this file as an argument to curl. Here is how I proceed (for the sake of the example, I ran the command in a bash interpreter, outside any batch script) :
I know want to query ElasticSearch for this content. Here is my query :
It outputs :
Basically, I expected metadata such as author, date and title to pop as attributes within the scope of "fields" on the previous json snip.
I tried with lot of epub files, and I found none that were correctly indexed, speaking of there metadata.
Thanks in advance, @cgravier