Letractively / dataparksearch

Automatically exported from code.google.com/p/dataparksearch
GNU General Public License v2.0
0 stars 0 forks source link

Enhancement: User defined plugins. #5

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
DPsearch has the ability to search across a very large set of documents (we
have tested with over 20M). We can search the entire document space or
parts of the document based on the concept of sections and limits (like
meta-tags, last-modified-date ...). However, like most search engines the
searches are restricted to information that has been indexed. Thus if we
have some new information about a document or existing information that was
not used to create special section or limit indexes then it becomes
difficult without re-indexing the collection.Additionally the additional
restrictions are best dealt with by other programs that could apply logic
that is not necessary "search" type. A couple of examples would be:

Lets assume that the documents indexed have some information about say the
geography associated with the document. However, when the collection was
originally indexed the geography was not considered important and no
geography section was created. It would be nice to be able to search the
document collection for the search criteria and then filter the results by
some geography restriction. Obviously the simples solution would be to add
in a definition of a geography section, and re-index the collection.
However with very large collections this is very expensive both in terms of
time and disk space. In addition we could end up with literally dozens if
not hundreds of sections.

Another situation would be where the documents found need to be restricted
on some criteria no related to a search (e.g. only show the documents
"permitted" to the user making the query). Again in theory we could do some
combinations of ownership and other restrictions indexed in - the
information is pretty dynamic and we will need to re-index all the time.

The solution proposed is to have the notion of "filter plugins" added to
dpsearch. The dpsearch engine gathers all the search results into an array
and then after removing duplicates, clones etc. retrieves the document
information for a pageful. In this case imagine a small user provided
function that is called after the result list build and cleaned but before
the document information retrieval step. The filter could then get a list
of record ids for the documents and then return a modified list that may
have some records removed (or added?) based on external criteria. This will
allow fine grained local control over the results. If such a mechanism were
available then we could solve the situations above by doing the following

Build a new database table (in the same database as used by dpsearch or a
separate one) that has a table tracking the record id and columns for other
meta-data. The plugin would then check filter the results using the
database information. Clearly it will be slower than a the index natively 
but for infrequently used but large or a new metadata that will be indexed
but as a transitional mechanism this would work quite well.

Similarly for the second example the plugin would call an external
permissions program that could resolve the permission based on other
criteria which have nothing to do with the search engine.

Finally we could add in records into the result set if deemed necessary
(though I suspect that this is better done outside the search engine when
creating a results page).

The changes that I see would be

Ability to build a plugin - best would be the ability to have a shared
library that can be setup in the config file. If defined and present then
the search engine would use it and if not it would not. The plugin API
should be very simple (at least for starters):
- Call to initialize the plugin
- call to re-initialize the plugin (when the search engine gets a HUP signal).
- Call to terminate the plugin
- Call to process a result list (I suspect only an array of proposed
results, a command line and returning the array of results).
- We could add additional APIs available to the plugin to access dpsearch
functions for ease of writing the plugin - e.g. functions to print messages
into the log ...

Changes to dpsearch 
- Addtitional paramters to pass information to the plugin. E.g.
&pluginparms="parm1, parm2, parm3"
- Configuration file changes to define the plugin
- Code changes to call the plugin.

Questions:
- What happens if there are multiple plugins?  Particularly passing
commandline parameters over.
- Where is the plugin called - when the results are obtained in cache.c or
sql.c or when they are assembled in search.c? Each has a plus/minus in
terms of having access to information (e.g. if there are multiple indexes
calling the plugin from cache.c or sql.c will mean that the plugin can get
the correct database information and be able to use that as opposed to in
search.c where the search may be running on a machine with no access to the
actual database).
- What languages should be allowed? Clearly the application is in C so C or
C++ is natural - however it is also easier to write plugins in some
scripting language.

Original issue reported on code.google.com by amitshar...@gmail.com on 18 Aug 2008 at 1:09

GoogleCodeExporter commented 9 years ago
In the latest snapshot of 4.53 version the Limit command has been extended so 
it can
accept a SQL query which return possible pairs of limit value and url.rec_id.
E.g.
Limit prm:strcrc32 "SELECT label, rec_id FROM labels" 
pgsql://u:p@database.ext/site/

The third parameter (DBAddr) is optional, used to specify a connection to an 
another
SQL-database where limit table resides.
prm - is the name of limit and the name of CGI-parameter is used for this limit
strcrc32 - is the type of limit, particularly this limit value is a string.

Instead of strcrc32 is possible to use any of the following limit types:
hex8str - hex string or base-26 string similar to those used in categories, and 
the
nested limit will be created;
int - integer value (4 byte integer).

In serach.htm and searchd.conf configuration files it's possible to specify 
reduced
variant of such Limit command:
Limit prm:strcrc32

Original comment by dp.max...@gmail.com on 3 May 2009 at 12:11