PiRSquared17 / bpbible

Automatically exported from code.google.com/p/bpbible
Other
0 stars 1 forks source link

Indexed Search for words in BPB ordinary finds more than words #112

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Making an indexed word search in BPB ordinary finds more than words:
Having WEB indexed: Searching for oil gives 199 results (including oil,
oils and oiled)
2. On the other hand, making an unindexed word search in BPB ordinary gives
exactly the words: Searching for oil gives 196 results (including oil but
not oils and oiled) 
3. On the other hand, making an unindexed or indexed word search in BPB
portable gives exactly the words: Searching for oil gives 196 results
(including oil but not oils and oiled) 
4. Making an indexed or unindexed word search in BPB ordinary or portable
with +oil gives exactly the words: 196 results (including oil but not oils
and oiled).

The documentation http://code.google.com/p/bpbible/wiki/Search is (only)
explaining how it works for indexed searches in the ordinary version:

'Stemming
Multi-word searches will perform stemming. Stemming refers to the process
of removing the ends of words, to find similar words. For example,
searching for test will now return results for tests, tested and testing as
well. If you want to search for the exact word, put a plus in front of the
word (for example, +test).'

This is: Searching for test gives only the word test for unindexed searches
and for indexed searches in the portable version.

What is the expected output? What do you see instead?

-> Make the indexed/unindex search tool the same for the ordinary BPBible:
I would prefer to find single words (as it does in the portable version)
and keep the + tool (+ preceeding words) for multi-word searches.

What version of the product are you using? On what operating system?
BPB 0.4.5 ordinary, and BPB 0.4.5 portable

If the problem happens with one particular module (Bible, commentary,
dictionary, or book), what module is it?

Please provide any additional information below.

Original issue reported on code.google.com by wolfgang...@gmx.de on 15 Oct 2009 at 12:35

GoogleCodeExporter commented 9 years ago
BPBible Portable and BPBible ordinary use the same code base, and so should 
perform
exactly the same.  I think the most likely reason is that you are using 
different
versions of BPBible (though I could be wrong, since stemming was included in 
0.4 and
that is quite a while ago).  Could you please check which version of BPBible 
you are
using for each one?

As to whether stemming should be the default or not, I find that it sometimes 
adds
useful and relevant additional results, and sometimes doesn't.  This makes it
difficult to weigh benefits against costs, but I would say that it is easier to 
skip
irrelevant results than to find results which have been missed by the search 
and you
haven't seen.  If this is true it suggests to me that stemming should be the 
default.

Original comment by jonmmor...@gmail.com on 16 Oct 2009 at 12:42

GoogleCodeExporter commented 9 years ago

Original comment by jonmmor...@gmail.com on 16 Oct 2009 at 12:42

GoogleCodeExporter commented 9 years ago
OK, obviously I didn't read your comment carefully enough (sorry about that).  
You do
say you are using 0.4.5 for both.

Original comment by jonmmor...@gmail.com on 17 Oct 2009 at 5:54

GoogleCodeExporter commented 9 years ago
Yes, I am using BP version 0.4.5 for both.
What irritated me, is that in
- BP portable, a simple search (e.g. for 'oil') finds exactly words both in 
indexed
and unindexed modules, whereas in 
- BP normal a simple search finds exact words for unindexed modules, but stems 
for
indexed ones.
So in three cases words are found and only in one case (BP normal, indexed 
search) stems.

I am in favour to set the default in simple searches for words.

In English, searching for stems work fine. 
e.g. searching for 'see' only finds words connected with the stem 'see' and not 
for
the stems 'seem' or 'seek' or 'seed'.
So the search for 'see' is different than the one for 'see*'
I suppose, that you have built in an additional tool to do this.

But this does not work for other languages.
E.g in German, searching for 'neu' (= new) finds two stems: 'neu' and 'neun' (= 
nine)
And I can immagine, that for other language (e.g. French) this will be the same 
case.

And as the Sword library offers material in many languages, searching for stems 
will
(probably) not work in those.

Original comment by wolfgang...@gmx.de on 17 Oct 2009 at 6:49

GoogleCodeExporter commented 9 years ago
If BPBible Portable does not use stemming for (indexed) searches, that is a 
bug. 
They both use the same code base and so they both should do stemming.  I have 
not yet
been able to reproduce this behaviour.

Stemming is done by the Snowball stemmer.  It looks like it will not do 
stemming for
languages that are not marked as supported (currently including English, 
Spanish,
French, German, Italian and a few others).

For unindexed search we use the search provided by Sword solely for those who 
for
whatever reason do not wish to build indexes.  It is not strongly supported or
recommended by us, and I do not consider it important that it does not produce 
the
same results.

Original comment by jonmmor...@gmail.com on 19 Oct 2009 at 1:51

GoogleCodeExporter commented 9 years ago
I have both BP ordinary and BP portable installed in Ubuntu/VirtualBox/WinXP (it
would be great to let it run directly in Ubuntu, but I did not manage to do 
this up
to now.)
I cross-checked it again and BP portable does not stemming searches.
Is this google software tracking system also for BP portable, or is there 
another one
used?

About stemming, this is a helpful feature, which really should be kept.
Thanks for the information about it.

What about the following: In order to make it earsier for users of 
- resources in no-Snobball supported languages and 
- both indexed and unindexed resources:

-> Let the 
- search for a word find the word only (e.g. 'see') and
- search for a word preceeded by + find the stem
This would give the result, that searching for a word would give as result the 
word
only independent if the language is Snowball stemmer supported or not and 
independent
if a resource is indexed or not.

In this case, searching for a word preceeded by + would work only for indexed
resources and in languages, which are Snowball stemmer supported.

Original comment by wolfgang...@gmx.de on 19 Oct 2009 at 5:47

GoogleCodeExporter commented 9 years ago
I understand your bug report, but I am still surprised that BPB Portable behaves
differently.  It is using exactly the same indexed search.

As for whether stemming should be the default, I still think it should be.  Most
users will not look at a help page, and so whatever is the default will be the 
most
commonly used one.  As said in comment 1, I think stemming is better on the 
balance
than not stemming, and so should probably be the default (though sometimes it is
certainly annoying and gets a lot of unnecessary and irrelevant results).  
However,
we should probably consider making it clearer what exactly it has searched for, 
and
make it easy to switch to just searching for the word (maybe something like 
"searched
for oil, oils and oiled.  Click here to search for just oil.")

Original comment by jonmmor...@gmail.com on 19 Oct 2009 at 10:09

GoogleCodeExporter commented 9 years ago
It is OK for me, to make searching for stems as default. As you mentioned in 
your
comment 7, it should be made clear what the program does. The reason for this 
is,
that many Bible software do not have stem-searching as default, but word 
searching
(e.g. BibleTime, Xiphos for indexed resources).

-> What about to build in 2 options in the search dialog:
- the stem option and
- the word option
in order to be able to choose what you want to search for. The stem option 
could be
chosen as default, so if you do not choose an option, the search will be 
searching
for stems (as default)

In case, that stem searching is not working (e.g. in languages without 
stemmers, or
in unindexed modules) the stem option should be greyed out and the word search 
activated.

Original comment by wolfgang...@gmx.de on 23 Oct 2009 at 11:41

GoogleCodeExporter commented 9 years ago
Just got BP installed in Linux, and the ordinary search (e.g. for see) finds 
words
only (ESV:663) and not stems.

Original comment by wolfgang...@gmx.de on 23 Oct 2009 at 11:43

GoogleCodeExporter commented 9 years ago
I have BPBible installed under Ubuntu (Karmic Koala), and for the ESV indexed 
search
I get:
Search "see": 770 references, 825 hits.
Search "+see": 663 references, 708 hits.

Unindexed search gets 663 references as expected.

Comment 8 sounds fine, but it's very unlikely to be in until the 0.5 series, 
which
will almost certainly not be this year.  I will still try to find out why 
stemming
isn't working for BPBible Portable and perhaps some Linux installations.

Thanks again.

Original comment by jonmmor...@gmail.com on 23 Oct 2009 at 11:51

GoogleCodeExporter commented 9 years ago
How have you installed BPBible under Linux?  Are you running it from source?

Have you installed PyStemmer?  If it is not present then BPBible will work 
fine, but
stemming will not be supported (it should print a warning to the console, 
"Snowball
not installed").  I installed it with "easy_install PyStemmer" and run BPBible 
from
source and it does have stemming.

Original comment by jonmmor...@gmail.com on 24 Oct 2009 at 12:14

GoogleCodeExporter commented 9 years ago
As the Linux install procedure is too difficult for me, I asked someone else to 
do
this for me. So I will forward this question to him.
As far as I remember, we installed it from source, using the instructions found 
in
<http://code.google.com/p/bpbible/wiki/RunningBPBible> 

Anyhow, in Ubuntu 8.04 (the last LTS version), I cann find PyStemmer in 
Synaptics. Is
this included in Karmic Koala?

Original comment by wolfgang...@gmx.de on 24 Oct 2009 at 2:20

GoogleCodeExporter commented 9 years ago
The running BPBible page did not list PyStemmer as a dependency - I have 
changed it
so that it does.

PyStemmer is not included in Karmic, as far as I can tell.  I will probably 
look at
getting it packaged for the next version of Ubuntu.  However, it can be 
installed
through the Python setup tool easy_install.  From memory, the steps I used were:
1. Install python-setuptools.
2. Install python-dev.
3. Run "easy_install PyStemmer".

Original comment by jonmmor...@gmail.com on 25 Oct 2009 at 9:23

GoogleCodeExporter commented 9 years ago
I tried rebuilding BPBible Portable and it stemmed correctly when using indexed
search.  It is possible that BPBible Portable 0.4.5 was built incorrectly, but 
I will
test 0.4.6 (both normal and portable) before releasing to ensure that stemming 
occurs
when expected.

Original comment by jonmmor...@gmail.com on 10 Nov 2009 at 12:12

GoogleCodeExporter commented 9 years ago
Great!

Do you also plan to make an overhaul of the search engine to make it easier to 
use?

Original comment by wolfgang...@gmx.de on 12 Nov 2009 at 7:26

GoogleCodeExporter commented 9 years ago
I doubt we will for the 0.4 line, I'm sorry.

Original comment by jonmmor...@gmail.com on 12 Nov 2009 at 12:04

GoogleCodeExporter commented 9 years ago
While it would be nice to have all of these issues fixed for 0.5, they are not 
critical to the 0.5 goals and so some of them may be deferred to after 0.5.

Original comment by jonmmor...@gmail.com on 17 Jul 2010 at 8:01