Closed GoogleCodeExporter closed 9 years ago
File types whose contents can be crawled by solr and their size is not larger
than 20MBs are posted to it during indexing of the file. So yes, your search
should show file.doc. Keep in mind that indexing includes both meta-data
(filename, tags, etc) and the file contents (if crawling is possible).
Currently we are using solr 1.3 and we intend to upgrade to sol 1.4 as soon as
possible. The problem is that 1.3 does not support crawling of rich docs and we
did it via a beta plugin (the maximum file size limitation was also enforced
because of issues with the plugin). The upgrade will solve this and various
other issues. Actually we intend to redesign the indexing / search
functionality.
Original comment by fstamate...@gmail.com
on 5 Nov 2010 at 11:36
So, if I do not get any results when searching rich file content, this is not
good? Like in the example above, I do not get filename containing the word.
I am running solr from solr example directory with solrconfig.xml file you
provided me months before, when I had issues with solr.
Regards,
Nikola
Original comment by ngara...@gmail.com
on 5 Nov 2010 at 12:50
I use solr 1.3 with patch for parsing rich documents, and when uploading for
example pdf file, only thing I see in solr.log is following:
INFO: [] webapp=/solr path=/update/rich
params={id=250&stream.type=pdf&fieldnames=id,name&commit=true&stream.fieldname=b
ody&name=iphone+user+guide+pdf+iphone_user_guide.pdf} status=0 QTime=12656
solrconfig.xml contains the line:
<requestHandler name="/update/rich" class="solr.RichDocumentRequestHandler" startup="lazy" />
What else am I missing?
Since I am running solr as standalone, I do not need to build it with ant, or?
I will try to run jboss server in debug mode to check if something shows up in
the log.
Original comment by ngara...@gmail.com
on 11 Nov 2010 at 5:23
Just a quick note. I believe we will have the new, re-designed and solr 1.4
based version in a week or so.
Original comment by fstamate...@gmail.com
on 11 Nov 2010 at 9:14
So, does that mean that currently context searching (rich document parsing)
does not work with solr 1.3?
Original comment by ngara...@gmail.com
on 12 Nov 2010 at 8:06
Migration to solr 1.4 has been completed. Changeset fe8bddad316f (branch
solr1.4) will be merged to the default branch when testing is finished
Original comment by chstath
on 19 Nov 2010 at 2:51
Could you attach solrconfig.xml you use, since example file has fieldnames from
example schema.xml file, and not required fieldnames.
Regards,
Nikola
Original comment by ngara...@gmail.com
on 22 Nov 2010 at 10:11
If you have checked out the solr1.4 branch then you already have the
solconfig.xml and schema.xml in gss source tree (solr/conf folder). The ant
build script contains some new targets like run-solr that will copy the solr
conf files to the right place and run solr.
Original comment by chstath
on 22 Nov 2010 at 10:41
But, files look like deafult ones, at least solrconfig.xml. For example, you
comment out price and cat and other things not used, but you leave everything
in solr search ...
I am talking about this:
<requestHandler name="dismax" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
</str>
<str name="pf">
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
</str>
<str name="bf">
popularity^0.5 recip(price,1,1000,1000)^0.3
</str>
<str name="fl">
id,name,price,score
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
Regards,
Nikola
Original comment by ngara...@gmail.com
on 22 Nov 2010 at 11:11
You are right, but we don't use the dismax handler any longer so I didn't touch
its configuration
Original comment by chstath
on 22 Nov 2010 at 11:19
Ok, another question about rich text documents. How do I know files is being
parsed? I downloaded some solr rich document test files (
https://issues.apache.org/jira/secure/attachment/12381612/test-files.zip ), and
uploaded 6a.Brennan-Performance.ppt file. I can search its context. I also have
some ppt files for test but seems like they dont get parsed at all.
So, are there any rules for files to get parsed or not, how do I know solr
parsed it succesfully or not.
Check file attached, it does not get parsed here.
Original comment by ngara...@gmail.com
on 23 Nov 2010 at 11:57
Attachments:
> So, are there any rules for files to get parsed or not, how do I know solr
parsed it succesfully or not.
We do not store the info whether the file contents were successfully indexed or
not. During indexing the MDB posts the file and meta data to Solr. If Solr
complains (exception, error code, etc) then we post again the meta data (but
not the file). For example a "locked" PDF will not be indexed.
I tried uploading your test2.ppt file and it seems ok. I searched with
"Storage" and "System" and the file came up in the search results. I'm testing
on the Solr1.4 branch. We are currently testing the new implementation and we
will soon merge it with HEAD. You will have to re-index though (if you have an
installation with actual files).
Also, as for whether you can see in the UI if your files were successfully
indexed: No, not in the current implementation. I would personally like to have
some indication on file properties or even in the file list, however, we have
not decided towards this direction yet.
Original comment by fstamate...@gmail.com
on 23 Nov 2010 at 1:25
Oh, that might be the case. I am still using old solr 1.3 gss version, sorry I
did not mention it.
About parsing, too bad there is no feedback visible to user. I understand it,
but from the user perspective, how can the user know what files are being
search by context and what are not...
Regards,
Nikola
Original comment by ngara...@gmail.com
on 23 Nov 2010 at 1:32
Original comment by chstath
on 29 Nov 2010 at 10:00
Original comment by chstath
on 29 Nov 2010 at 11:41
Original issue reported on code.google.com by
ngara...@gmail.com
on 5 Nov 2010 at 10:55