fergiemcdowall / search-index

A persistent, network resilient, full text search library for the browser and Node.js
MIT License
1.38k stars 149 forks source link

Indexing with AWS-S3 Bucket is failing #560

Open ApsaraDhanasekar11 opened 2 years ago

ApsaraDhanasekar11 commented 2 years ago

Hi, I tried to create index with a different backend (AWS -S3 bucket) using the s3leveldown module as a DB store option. The Index is being created, but while querying using _SEARCH/ QUERY methods, the result set is inappropriate. Like for eg, when I initialise the DB with the S3 bucket and use the PUT method to add documents, where my text is "Final is the file name".. and "what is the version" . This is how its created ::

current Indexed one:: { key: 'description:file#0.60', value: [ '1635744247556-1-1' ] }. Another one:: { key: 'description:version#0.50', value: [ '1635744285856-1-1' ] }

I am able to see the above in my store, when I do a createReadStream. But when my search keyword is "version", my expected result should be only the 2nd indexed document. But it gives me both 1st and 2nd.I tried using both _SEARCH/ QUERY methods, but both r giving same wrong/additional results. I took reference from the below test folder examples:: https://github.com/fergiemcdowall/search-index/blob/master/test/src/memdown-test.js . Can someone guide on the correct approach for implementing other backend store options like Amazon-S3 ?

fergiemcdowall commented 2 years ago

Thanks for the bug report @ApsaraDhanasekar11. I think I understand your problem, but there are too many variables to reproduce it accurately.

Could you include a standalone script/test that demonstrates the issue?

ApsaraDhanasekar11 commented 2 years ago

Hi @fergiemcdowall , thanks for replying back. Please find the below example code and help us with details. Thanks!

const levelup = require('levelup'); const si = require('search-index'); const s3leveldown = require('s3leveldown');

const s3Store = await levelup(s3leveldown(bucketName, S3Client));

const idx = await si({ db: s3Store, storeVectors: true });

let data = [ { _id: 'a', description: 'Use template to list' }, { _id: 'b', description: 'All versions and updates' }, { _id: 'c', description: 'Final is the file name' } ];

const result = await idx.PUT(data, { storeVectors: true });

// results is :: [ { _id: 'a', operation: 'PUT', status: 'CREATED' }, { _id: 'b', operation: 'PUT', status: 'CREATED' }, { _id: 'c', operation: 'PUT', status: 'CREATED' } ]

// ** The above code creates the index as below::
{ key: 'description:file#1.00', value: [ 'c' ] } { key: 'description:final#1.00', value: [ 'c' ] } { key: 'description:list#1.00', value: [ 'a' ] } { key: 'description:name#1.00', value: [ 'c' ] } { key: 'description:template#1.00', value: [ 'a' ] } { key: 'description:updates#1.00', value: [ 'b' ] } { key: 'description:use#1.00', value: [ 'a' ] } { key: 'description:versions#1.00', value: [ 'b' ] } { key: '○DOCUMENT_COUNT○', value: 3 } { key: '○DOC_RAW○a○', value: { _id: 'a', description: 'Use template to list' } } { key: '○DOC_RAW○b○', value: { _id: 'b', description: 'All versions and updates' } } { key: '○DOC_RAW○c○', value: { _id: 'c', description: '"Final is the file name' } } { key: '○DOC○a○', value: { _id: 'a', description: [ 'list#1.00', 'template#1.00', 'to#1.00', 'use#1.00' ] } } { key: '○DOC○b○', value: { _id: 'b', description: [ 'all#1.00', 'and#1.00', 'updates#1.00', 'versions#1.00' ] } } { key: '○DOC○c○', value: { _id: 'c', description: [ 'file#1.00', 'final#1.00', 'is#1.00', 'name#1.00', 'the#1.00' ] } } { key: '○FIELD○description○', value: 'description' } //// ***** ///////

// ** For the Search/ query : *** // const result = await indexedDb.QUERY( { GET: { FIELD: ['description'], VALUE: { GTE: 'versions', LTE: 'versions' }, } }); ----> Tried other options like Query->(GET, SEARCH) , _SEARCH, _GET

But the result was :: RESULT: [ { _id: 'c', _match: [ 'description:file#1.00', 'description:final#1.00', 'description:name#1.00' ] }, { _id: 'a', _match: [ 'description:list#1.00', 'description:template#1.00', 'description:use#1.00' ] }, { _id: 'b', _match: [ 'description:updates#1.00', 'description:versions#1.00' ] } ], RESULT_LENGTH: 3 } :: which is actually giving results of all previous alphabetical words from a-v (as "versions" begins with v)

While trying to identify the process flow, I noticed the GET function has internal implementation of db.createReadStream method which should actually filter the data according to the keywords passed in GTE & LTE. But looks like this is failing and instead bringing up the entire result set restricting upto the first character (alphabetic order)..