codelibs / elasticsearch-river-web

Web Crawler for Elasticsearch
Apache License 2.0
234 stars 57 forks source link

How to index text inside <div> tags #20

Open srinivasv2 opened 10 years ago

srinivasv2 commented 10 years ago

Hi,

Can anyone help me on indexing text between particular

tags something like: < div data-canvas-width="125.304" data-font-name="g_font_580_0" data-angle="0" style="font-size: 24px; font-family: sans-serif; left: 64px; top: 172px; transform: rotate(0deg) scale(1.00243, 1); transform-origin: 0% 0% 0px;" dir="ltr">Automotive < /div>

This is to index some content in pdf files as per my requirement.

Thanks In Advance, Srinivas

marevol commented 10 years ago

This is to index some content in pdf files as per my requirement.

Is div tag in PDF file??

srinivasv2 commented 10 years ago

Yes, this div tag is in pdf file. I need to index all such kind of pdf data for my requirement.

marevol commented 10 years ago

Hmm, extracting contents with CSS query supports HTML only. So, it's difficult to do that..

srinivasv2 commented 10 years ago

Okay thanks for your response. Actually my intention is to extract some data from pdf files to display as title and description in the search page just like we show for normal html pages where I am getting empty field when I try to index "title" in crawl pattern.

Search result should be like below:

[PDF] Automotive Tote Labeling ... Printers & Media Application Brief Automotive Manufacturing Labeling Industry Need Public Safety and 24/7 production ...

Please let me know any alternate solution to index and fetch any particular data in pdf files which we are able to do in our current search application. As of now I am just able to index only URL and body fields for pdf's in ES where almost body content is in binary format.

Thanks, Srinivas

marevol commented 10 years ago

An attachment type might work... Please see Use attachment type.