apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.03k forks source link

Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene [LUCENE-724] #1799

Closed asfimport closed 13 years ago

asfimport commented 17 years ago

Here a preliminary implementation of the Oracle JVM Directory data store which replace a file system by BLOB data storage. The reason to do this is:

create index it1 on t1(f2) indextype is LuceneIndex parameters('test');

assuming that the table t1 has a column f2 of type VARCHAR2, CLOB or XMLType, after this, the query against the Lucene inverted index can be made using a new Oracle operator:

select * from t1 where contains(f2, 'Marcelo') = 1;

the important point here is that this query is integrated with the execution plan of the Oracle database, so in this simple example the Oracle optimizer see that the column "f2" is indexed with the Lucene Domain index, then using the Data Cartridge API a Java code running inside the Oracle JVM is executed to open the search, a fetch all the ROWID that match with "Marcelo" and get the rows using the pointer, here the output:

SELECT STATEMENT ALL_ROWS 3 1 115 TABLE ACCESS(BY INDEX ROWID) LUCENE.T1 3 1 115 DOMAIN INDEX LUCENE.IT1

Another benefits of using the Data Cartridge API is that if the table T1 has insert, update or delete rows operations a corresponding Java method will be called to automatically update the Lucene Index. There is a simple HTML file with some explanation of the code. The install.sql script is not fully tested and must be lunched into the Oracle database, not remotely. Best regards, Marcelo.


Migrated from LUCENE-724 by Marcelo F. Ochoa, 1 vote, resolved Jan 26 2011 Environment:

Oracle 10g R2 with latest patchset, there is a txt file into the lib directory with the required libraries to compile this extension, which for legal issues I can't redistribute. All these libraries are include into the Oracle home directory,

Attachments: ojvm.tar.gz, ojvm-01-09-07.tar.gz, ojvm-09-27-07.tar.gz, ojvm-11-28-06.tar.gz, ojvm-12-20-06.tar.gz

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

see patch description

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

This new version of the OracleJVM extension for Lucene has these changes:

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

This new release of the OJVMDirectory Lucene Store includes a fully functional Oracle Domain Index with a queue for update/insert massive operations and a lot of performance improvement. See the db/readmeOJVM.html file for more detail.

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

Latest code includes:

TODO:

asfimport commented 17 years ago

Michael Goddard (migrated from JIRA)

Marcelo,

Are you still working on this? I have been experimenting with it recently – thank you for creating it. Do you think that the I/O might be faster if the Vector was replaced with BLOB I/O via InputStream, OutputStream directly? That is what I am working with right now, and I did observe my indexing time for a sample data set go from 22 seconds to 13 seconds. I do currently have the problem that the resulting index is not behaving correctly and am working on that.

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

Michel: I am not tested replacing vector based storage to direct BLOB IO. Now I am too busy in a project, may be I'll have some time in a few week. If you are replacing the vector based access by BLOB IO sure I would like to test it. I am having some open issues specially with the integration of the data cartridge API and the optimizer. Do you have access to an open CVS server to share the code? If not, we can use DBPrism cvs repository at Source Forge. Also in a few week Oracle 11g will be ready for download at OTN website, so you can get a lot of performance improvement by using SECURE LOB (faster than NFS storage) and the JDK 1.5 JIT included in latest Oracle JVM. Best regards, Marcelo.

– Marcelo F. Ochoa http://marcelo.ochoa.googlepages.com/home


Do you Know DBPrism? Look @ DB Prism's Web Site http://www.dbprism.com.ar/index.html More info? Chapter 17 of the book "Programming the Oracle Database using Java & Web Services" http://www.amazon.com/gp/product/1555583296/ Chapter 21 of the book "Professional XML Databases" - Wrox Press http://www.amazon.com/gp/product/1861003587/ Chapter 8 of the book "Oracle & Open Source" - O'Reilly http://www.oreilly.com/catalog/oracleopen/

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

Joaquin at lucene-java-dev wrote: I'm very happy to announce the partial rework and extension to LUCENE-724 (Oracle-Lucene Integration), primarily based on new requirements from LendingClub.com, who commissioned the work to Marcelo Ochoa, the contributer of the original patch (great job Marcelo!). As contribution of LendingClub.com to the Lucene community we have posted the code on a public CVS (sourceforge) as explained below.

Here at Lending Club ( www.lendingclub.com) we have very specific needs regarding the indexing of both structured and unstructured data, most of it transactional in nature and siting in our Oracle !0gR2 DB, with a highly complex schema. Our "ranking" of loans in the inventory includes components of exact, textual and hardcore mathematical calculations including time, amount and spatial constraints. This integration of Lucene into Oracle as a Domain Index will now allow us to query this inventory in real-time. Going against the Lucene index, created on "synthetic documents" comprised of fields being populated from diverse tables (user data store), eliminates the need to create very complex joins to link data from different tables at query time. This, along with the support of the full Lucene query language, makes this a great alternative to:

  1. Using Lucene outside the database which requires "crawling" the data and storing the index outside the database, loosing all the benefits of a fully transactional system and a secure environment.
  2. Using Oracle Text, which is very powerful but lacks the extensibility and flexibility that Lucene offers (for example, being able to query directly the index from the Java layer or implementing our our ranking algorithm), though to be completely fair some of it is addressed in the new Oracle DB 11g version.

If anyone is interested in learning more how we are going to use this within Lending Club, please drop me a line. BTW, please make sure you check us out: "Lending Club ( http://www.lendingclub.com/), the rapidly growing people-to-people (P2P) lending service that launched as a Facebook application in May 2007, today announced the public availability of its services with the launch of LendingClub.com. Lending Club connects lenders and borrowers based upon shared affinities, enabling them to bypass banks to secure better interest rates on loans"... more about the announcement here http://www.sys-con.com/read/428678.htm. We have seen man entrepreneurs applying for loans and being helped by regular people to build their business with the money obtained at very low interest.

OK, without further marketing stuff (sorry for that), here is the original note sent to me by Marcelo that summarizes all the new cool functionalities:

OJVMDirectory, a Lucene Integration running inside the Oracle JVM is going one step further.

This new release includes:

Some sample usages:

create table t2 ( f4 number primary key, f5 VARCHAR2(200)); create table t1 ( f1 number, f2 CLOB, f3 number, CONSTRAINT t1_t2_fk FOREIGN KEY (f3) REFERENCES t2(f4) ON DELETE cascade); create index it1 on t1(f3) indextype is lucene.LuceneIndex parameters('Analyzer:org.apache.lucene.analysis .SimpleAnalyzer;ExtraCols:f2');

alter index it1 parameters('ExtraCols:f2,t2.f5;ExtraTabs:t2;WhereCondition:t1.f3=t2.f4;DecimalFormat:000');

Lucene domain index will store f2 and f3 columns of table t1 plus f5 of table t2.

So you can query then with:

select lscore(1),f2 from t1 where lcontains(f3, 'f2:test',1) > 0; or select lscore(1),f2 from t1 where lcontains(f3, 'f2:test and f3:[001 to 200]',1) > 0;

select /*+ DOMAIN_INDEX_SORT */ lscore(1),f2,t2.f5 from t1,t2 where lcontains(f3, 'f2:test1 and f3:[001 to 200] and t2.f5:test2',1) > 0 and t1.f3=t2.f4 order by lscore(1) asc;

In latest example Oracle's optimizer will assume that Lucene Domain Index will resolve first a set of rowid matching "f2:test1 and f3:[001 to 200] and t2.f5:test2" then will direct access by by index rowid on table t1 and perform the join with t2.

More examples and information can be found at: http://dbprism.cvs.sourceforge.net/dbprism/ojvm/Readme.txt?revision=1.10&view=markup

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

This new release includes:


Thanks to LendingClub.com to support this contribution.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Is the intent of this to be committed as a contrib module (I notice you do grant ASF license)? This seems like really useful stuff, just not sure how it should be incorporated into Lucene such that we can maintain it. Presumably it needs an Oracle DB to run, right? I also notice CVS directories, etc.

asfimport commented 17 years ago

Marcelo F. Ochoa (migrated from JIRA)

Hi Grant: I would like to share this code with all Lucene users. Sure it depends on Oracle libraries to compile, (see required-libs.txt file at lib directory). The code is designed to be extracted at contrib directory of Lucene 2.2.0 layout and only requires a minor change at main Lucene's build.xml file: <target name="jar-test" depends="compile-test"> <jar destfile="${build.dir}/${final.name}-test.jar" basedir="${build.dir}/classes/test" excludes="*/.java" /> </target> Which packages Lucene's test suites as jar for uploading inside Oracle JVM. As part of the contract with LendingClub.com I suggested that the license and the code still as Apache 2.0 license and sure they agree on that. I uploaded the code into source forge to provide daily changes to LendingClub team but we can move the code to apache CVS if you want. Best regards, Marcelo. – Marcelo F. Ochoa http://marceloochoa.blogspot.com/

asfimport commented 16 years ago

Marcelo F. Ochoa (migrated from JIRA)

Hi Grant: I would like to share this code with all Lucene users. Sure it depends on Oracle libraries to compile, (see required-libs.txt file at lib directory). The code is designed to be extracted at contrib directory of Lucene 2.2.0 layout and only requires a minor change at main Lucene's build.xml file: <target name="jar-test" depends="compile-test"> <jar destfile="${build.dir}/${final.name}-test.jar" basedir="${build.dir}/classes/test" excludes="*/.java" /> </target> Which packages Lucene's test as jar for uploading inside Oracle JVM. As part of the contract with LendingClub.com I suggested that the license and the code still as Apache 2.0 license and sure they agree on that. I uploaded the code into source forge to provide daily changes to LendingClub team but we can mode the code to apache CVS if you want. Best regards, Marcelo.

– Marcelo F. Ochoa http://marceloochoa.blogspot.com/ http://marcelo.ochoa.googlepages.com/home


Do you Know DBPrism? Look @ DB Prism's Web Site http://www.dbprism.com.ar/index.html More info? Chapter 17 of the book "Programming the Oracle Database using Java & Web Services" http://www.amazon.com/gp/product/1555583296/ Chapter 21 of the book "Professional XML Databases" - Wrox Press http://www.amazon.com/gp/product/1861003587/ Chapter 8 of the book "Oracle & Open Source" - O'Reilly http://www.oreilly.com/catalog/oracleopen/

asfimport commented 13 years ago

Shai Erera (@shaie) (migrated from JIRA)

Due to long inactivity, and because I'm not sure we want to introduce dependencies on Oracle, (or DB2, SqlServer etc.). We have a DBDirectory over Berkley DB which demonstrates how to create a Directory impl over some DB instance.