Ability to split documents per page so one elasticsearch entry per page

dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)

https://fscrawler.readthedocs.io/

Apache License 2.0

1.34k stars 297 forks source link

Ability to split documents per page so one elasticsearch entry per page #767

Open jawiz opened 5 years ago

jawiz commented 5 years ago

When indexing large documents you may hit limits not only on the indexing part, but also when doing searches.

Splitting documents into one entry per page helps slice up large documents into bite-size chunks and help performance of indexing and searching in the documents.

dadoonet commented 5 years ago

Sadly Tika does not offer this AFAIK.

dadoonet commented 5 years ago

Actually @tballison wrote recently on discuss:

It is simpler than that. Just use the ToXMLContentHandler to get an XML String, and then run a SAXParser (or JSoup in case we're not getting our tags right :D) against that xml, and parse the content per page. No need to send anything back to Tika. I can demo it for you pretty easily...

So that should be doable. I need to play a bit around it. :)

jawiz commented 5 years ago

Yeah it's totally doable. I wrote a small program that cover my needs in Python. It only works on .pdf and not .docx. Basically Tika parses to html and each page is a div with class page.

In Python I wrote it using BeautifulSoup 4 html parser to parse the HTML.

def pageSplit(rawContent):
    content = rawContent["content"]
    soup = BeautifulSoup(content, "html.parser")
    pages = []
    for page in soup.find_all('div', {"class": "page"}):
        pages.append(page.text)
    return pages

mchari commented 4 years ago

do you have an estimated date of availability so I can decide either to interface with ES myself or wait for the functionality ? Thanks.

dadoonet commented 4 years ago

Absolutely no idea. I believe it won't happen before 2020 unless someone wants to add it to the project.

mchari commented 4 years ago

Hi David, I am able to add pages encoded in base64 using instructions in https://kb.objectrocket.com/elasticsearch/how-to-index-a-pdf-file-as-an-elasticsearch-index-267 I used es.index() to add the encoded pages into ES. I did not specify any document id. I confirmed that there is a document in my ES index. But when I try to query for content such as qres= es.search(index="prestotest",body={ "query" : { "bool" : { "must": [{ "match": { "content": "Informed consent" } } ] } } }) print(qres['hits']['total'])

I don't see any hits. Any idea how I could make it work ?
Thanks