fergiemcdowall / search-index

A persistent, network resilient, full text search library for the browser and Node.js
MIT License
1.39k stars 149 forks source link

Building index by calling PUT once for each document does not return all matches in query afterwards #537

Closed elega9t closed 3 years ago

elega9t commented 3 years ago

Hi,

I am building my index by async loading files and calling a PUT for each file once its loaded.

The QUERY on the index behaves inconsistently. Eg, lets say the number of matching documents for a query is 3; sometimes the query returns 1 match, sometimes 2; never 3.

I have tried changing the code to collect all documents in an array and doing just 1 PUT. In such a case, the query returns all 3 documents.

I do not want to load all documents in memory at once, thus would want to revert back to calling PUT for each documents when its loaded.

Any insights on this issue would be very much appreciated.

Thanks.

fergiemcdowall commented 3 years ago

It sounds like you may have some basic issues with the way you have set search-index up.

If you can post the code here, then we can try to help you.

elega9t commented 3 years ago

Hi, here are the relevant extracts of code;

` const { PUT, QUERY } = await si({ db: memdown('journals') });

const files: Array = [...]; // List of files

const indexingPromises = [];
_.forEach(files, (file: string) => {
  indexingPromises.push(this.indexFile(file));
});
Promise.all(indexingPromises)
  .then(() => {
    console.log(moment.utc().format() + ': Finished building search index');
  })
  .catch((err) => {
    console.log(moment.utc().format() + ': Failed building search index', err);
  });

` and

` indexFile(file: string) { return new Promise((resolve, reject) => { const promises = [];

   const readStream = fs.createReadStream(file).pipe(split(JSON.parse));

   readStream.on('data', (entry) => {
       promises.push(PUT(entry));
  });

   readStream.on('end', () => {
     Promise.all(promises)
       .then(() => {
         resolve();
       });
   });

   readStream.on('error', (err) => {
     reject(err)
   });
 });

} ` Hope this explains the setup I am using. Thanks

fergiemcdowall commented 3 years ago

How about arranging your code like this:

(async function() {
  const fs = require('fs')
  const si = require('search-index')
  const memdown = require('memdown')

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  })

  const batch = fs.readdirSync('journals').map(
    j => JSON.parse(fs.readFileSync('journals/' + j))
  )

  await PUT(batch)

  console.log(JSON.stringify(await QUERY({
    OR: ['dag', 'emmey', 'dorie', 'cordell', 'gregoire']
  }), null, 2))

}())

I put together an example repo. As you can see the query always returns a consistent result.

elega9t commented 3 years ago

Hi thanks for taking a look. I wanted to use streams and promise because each file can have multiple json entries and the files themselves could be huge (gigs). So, reading the entire content in memory at once is not something feasible.

I'll create a branch on the example repo and update the code and sample data to resemble what I am working with.

elega9t commented 3 years ago

Hi, I have updated the code in fork https://github.com/elega9t/search-index-issue-537

If you run the code multiple times, you would notice the inconsistent results being returned.

Based on the query, it should return 3 results, however you would notice that sometimes it returns 1 and sometimes 2.

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  });

  function readFile(file) {
    return new Promise((resolve, reject) => {
      let readStream = fs.createReadStream('journals/' + file).pipe(split(JSON.parse, null, { trailing: false }));

      const promises = [];
      readStream.on('data', (entry) => {
        promises.push(PUT([ entry ]));
      });

      readStream.on('end', () => {
        Promise.all(promises)
          .then(() => resolve());
      });

      readStream.on('error', (err) => {
        reject(err)
      });
    });
  }

  const promises = fs.readdirSync('journals').map(file => readFile(file));

  await Promise.all(promises);

  const results = await QUERY(
    { SEARCH: [ 'Chaudron' ] },
    { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
  );

  console.log(JSON.stringify(results));

}())

Thanks

fergiemcdowall commented 3 years ago

The problem here is that you can't PUT documents concurrently so this line is creating the problem. You should wait for the previous PUT to finish before starting a new PUT. This should probably be better explained in the documentation.

Generally its best to PUT batches of documents rather rather than individual documents, so the best approach here would probably still be to PUT all the objects in the file at once, rather than PUTing them individually.

Iterating over collections of Promises withing the context of streams remains a tricky thing to do, so personally I avoid streams for document processing even though they seem like a reasonable approach at first glance.

elega9t commented 3 years ago

Thanks a lot for looking into this issue. I will write a flatMap utility to queue PUTs for now.

elega9t commented 3 years ago

Hi, I have chained Promises so that they run sequentially rather than concurrently. However, now I get 2 results consistently, rather than expected 3.

function readFile(file) {
  return new Promise((resolve, reject) => {
    let readStream = fs.createReadStream('journals/' + file).pipe(split(JSON.parse, null, { trailing: false }));

    const promise = Promise.resolve();
    readStream.on('data', (entry) => {
      promise.then(() => {
        console.log('Indexing: ' + JSON.stringify(entry));
        return PUT([ entry ]);
      });
    });

    readStream.on('end', () => {
      promise.then(() => resolve());
    });

    readStream.on('error', (err) => {
      reject(err)
    });
  });
}

let promise = Promise.resolve();

fs.readdirSync('journals').forEach(file => {
  promise = promise.then(() => { return readFile(file); });
});

await promise;

const results = await QUERY(
  { SEARCH: [ 'Chaudron' ] },
  { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
);

console.log(JSON.stringify(results));

Please, can you take a look at this and suggest a workaround for now.

fergiemcdowall commented 3 years ago

A sensible approach would be to do something like this:

(async function() {
  const fs = require('fs')
  const split = require('split')
  const si = require('search-index')
  const memdown = require('memdown')

  const { PUT, QUERY } = await si({
    db: memdown('journals')
  });

  const readFile = file => new Promise(resolve => {
    let batch = []
    let readStream = fs.createReadStream('journals/' + file).pipe(
      split(JSON.parse, null, { trailing: false })
    ).on(
      'data', d => batch.push(d)
    ).on(
      'end', () => resolve(batch)
    )
  })

  for (f of fs.readdirSync('journals')) {
    await PUT((await readFile(f)))
  }

  const results = await QUERY(
    { SEARCH: [ 'Chaudron' ] },
    { DOCUMENTS: true, PAGE: { NUMBER: 0, SIZE: 20 } }
  );

  console.log(JSON.stringify(results))

}())
elega9t commented 3 years ago

Hi, the bug in my sequenced code was to update the promise once document has been PUT. I really dont want to accumulate all documents in memory, so returning array and then doing a PUT is not suitable.

  let promise = Promise.resolve();
  readStream.on('data', (entry) => {
    promise = promise.then(() => {
      console.log('Indexing: ' + JSON.stringify(entry));
      return PUT([ entry ]);
    });
  });

Now I am getting all 3 documents as expected. Thanks for your help. cheers!

fergiemcdowall commented 3 years ago

Just to reiterate- PUTing batches is computationally less expensive than PUTing those documents individually- so it will always be quicker to use batches. Start off with the batch approach, and only switch to individual PUTs if/when you run out of memory.

Also, if memory use is a pain point, then dont use memdown- just stick with the default leveldown/leveljs.

In the forthcoming search-index@3 it is possible to PUT in parallel.

Anyway- glad to hear that you got it working! 👍🙂