Solr and rich documents

GoogleCodeExporter commented 9 years ago

Just a quick question, is solr expected to crawl rich documents and extract all 
data from them or not? I read somewhere that it should work like that, for 
example if I have word "example" in my file.doc, and try search "example", 
file.doc should show up, right? 

Can you please further explain to me what is expected when using solr with rich 
documents, crawling or only file indexing?

Regards,
Nikola

Original issue reported on code.google.com by ngara...@gmail.com on 5 Nov 2010 at 10:55

GoogleCodeExporter commented 9 years ago

File types whose contents can be crawled by solr and their size is not larger 
than 20MBs are posted to it during indexing of the file. So yes, your search 
should show file.doc. Keep in mind that indexing includes both meta-data 
(filename, tags, etc) and the file contents (if crawling is possible).

Currently we are using solr 1.3 and we intend to upgrade to sol 1.4 as soon as 
possible. The problem is that 1.3 does not support crawling of rich docs and we 
did it via a beta plugin (the maximum file size limitation was also enforced 
because of issues with the plugin). The upgrade will solve this and various 
other issues. Actually we intend to redesign the indexing / search 
functionality.

Original comment by fstamate...@gmail.com on 5 Nov 2010 at 11:36

GoogleCodeExporter commented 9 years ago

So, if I do not get any results when searching rich file content, this is not 
good? Like in the example above, I do not get filename containing the word. 
I am running solr from solr example directory with solrconfig.xml file you 
provided me months before, when I had issues with solr.

Regards,
Nikola

Original comment by ngara...@gmail.com on 5 Nov 2010 at 12:50

GoogleCodeExporter commented 9 years ago

I use solr 1.3 with patch for parsing rich documents, and when uploading for 
example pdf file, only thing I see in solr.log is following:

INFO: [] webapp=/solr path=/update/rich 
params={id=250&stream.type=pdf&fieldnames=id,name&commit=true&stream.fieldname=b
ody&name=iphone+user+guide+pdf+iphone_user_guide.pdf} status=0 QTime=12656

solrconfig.xml contains the line:

 <requestHandler name="/update/rich" class="solr.RichDocumentRequestHandler" startup="lazy" />

What else am I missing?

Since I am running solr as standalone, I do not need to build it with ant, or? 
I will try to run jboss server in debug mode to check if something shows up in 
the log.

Original comment by ngara...@gmail.com on 11 Nov 2010 at 5:23

GoogleCodeExporter commented 9 years ago

Just a quick note. I believe we will have the new, re-designed and solr 1.4 
based version in a week or so.

Original comment by fstamate...@gmail.com on 11 Nov 2010 at 9:14

GoogleCodeExporter commented 9 years ago

So, does that mean that currently context searching (rich document parsing) 
does not work with solr 1.3?

Original comment by ngara...@gmail.com on 12 Nov 2010 at 8:06

GoogleCodeExporter commented 9 years ago

Migration to solr 1.4 has been completed. Changeset fe8bddad316f (branch 
solr1.4) will be merged to the default branch when testing is finished

Original comment by chstath on 19 Nov 2010 at 2:51

Changed state: Started

GoogleCodeExporter commented 9 years ago

Could you attach solrconfig.xml you use, since example file has fieldnames from 
example schema.xml file, and not required fieldnames.

Regards,
Nikola

Original comment by ngara...@gmail.com on 22 Nov 2010 at 10:11

GoogleCodeExporter commented 9 years ago

If you have checked out the solr1.4 branch then you already have the 
solconfig.xml and schema.xml in gss source tree (solr/conf folder). The ant 
build script contains some new targets like run-solr that will copy the solr 
conf files to the right place and run solr.

Original comment by chstath on 22 Nov 2010 at 10:41

GoogleCodeExporter commented 9 years ago

But, files look like deafult ones, at least solrconfig.xml. For example, you 
comment out price and cat and other things not used, but you leave everything 
in solr search ...

I am talking about this:

<requestHandler name="dismax" class="solr.SearchHandler" >
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
     <float name="tie">0.01</float>
     <str name="qf">
        text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
     </str>
     <str name="pf">
        text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
     </str>
     <str name="bf">
        popularity^0.5 recip(price,1,1000,1000)^0.3
     </str>
     <str name="fl">
        id,name,price,score
     </str>
     <str name="mm">
        2<-1 5<-2 6<90%
     </str>

Regards,
Nikola

Original comment by ngara...@gmail.com on 22 Nov 2010 at 11:11

GoogleCodeExporter commented 9 years ago

You are right, but we don't use the dismax handler any longer so I didn't touch 
its configuration

Original comment by chstath on 22 Nov 2010 at 11:19

GoogleCodeExporter commented 9 years ago

Ok, another question about rich text documents. How do I know files is being 
parsed? I downloaded some solr rich document test files ( 
https://issues.apache.org/jira/secure/attachment/12381612/test-files.zip ), and 
uploaded 6a.Brennan-Performance.ppt file. I can search its context. I also have 
some ppt files for test but seems like they dont get parsed at all.

So, are there any rules for files to get parsed or not, how do I know solr 
parsed it succesfully or not.

Check file attached, it does not get parsed here.

Original comment by ngara...@gmail.com on 23 Nov 2010 at 11:57

Attachments:

test2.ppt

GoogleCodeExporter commented 9 years ago

> So, are there any rules for files to get parsed or not, how do I know solr 
parsed it succesfully or not.

We do not store the info whether the file contents were successfully indexed or 
not. During indexing the MDB posts the file and meta data to Solr. If Solr 
complains (exception, error code, etc) then we post again the meta data (but 
not the file). For example a "locked" PDF will not be indexed.

I tried uploading your test2.ppt file and it seems ok. I searched with 
"Storage" and "System" and the file came up in the search results. I'm testing 
on the Solr1.4 branch. We are currently testing the new implementation and we 
will soon merge it with HEAD. You will have to re-index though (if you have an 
installation with actual files).

Also, as for whether you can see in the UI if your files were successfully 
indexed: No, not in the current implementation. I would personally like to have 
some indication on file properties or even in the file list, however, we have 
not decided towards this direction yet.

Original comment by fstamate...@gmail.com on 23 Nov 2010 at 1:25

GoogleCodeExporter commented 9 years ago

Oh, that might be the case. I am still using old solr 1.3 gss version, sorry I 
did not mention it.

About parsing, too bad there is no feedback visible to user. I understand it, 
but from the user perspective, how can the user know what files are being 
search by context and what are not...

Regards,
Nikola

Original comment by ngara...@gmail.com on 23 Nov 2010 at 1:32

GoogleCodeExporter commented 9 years ago

Original comment by chstath on 29 Nov 2010 at 10:00

GoogleCodeExporter commented 9 years ago

Original comment by chstath on 29 Nov 2010 at 11:41

Changed state: Fixed

chstath / gss

Solr and rich documents #51