FAC10 / week4-SoFLY

https://obscure-brook-45630.herokuapp.com
5 stars 1 forks source link

Technical Spike: Searching Through Large Databases #9

Open lucyrose93 opened 7 years ago

lucyrose93 commented 7 years ago

Should we use .json or .txt? What's the most efficient & effective way to 🔍 through the file?

Comment your findings below...

samatar26 commented 7 years ago

We can use the fs node core module, which has a method called readFile. This takes as it's first parameter the filename and a callback function. In the callback function, we can write filecontent.indexOf('string'). If that's greater than -1, this means the string is found inside the text file.

require("fs").readFile("filename.ext", function(err, filecontent) {
    if (err)
        throw err;
    console.log("String"+(filecontent.indexOf("search string")>-1 ? " " : " not ")+"found");
});
finnhodgkin commented 7 years ago

From a guy on SO trying to search through a huge (2 mill +) dataset:

I have split the records into different text files (at most 200 records per file) and put the files in different directories (I used the content of one data field to determine the directory tree). I end up with about 50000 files in about 40000 directories. I have then run Lucene to index the files. Searching for a string with the Lucene demo program is pretty fast. Splitting and indexing took a few minutes: this is totally acceptable for me because it is a static data set that I want to query.

So for the large .tct file maybe we will have to split the data into smaller files (a-z possibly) and then narrow down from there.

finnhodgkin commented 7 years ago

http://lunrjs.com/ <-- a search framework we could inspect

lucyrose93 commented 7 years ago

Wow -- that's thinking big! .json is, alternatively, easy to manipulate with JSON.parse() and JSON.stringify()

samatar26 commented 7 years ago

Apparently using a stream can handle larger files. It's on the fs module:

var fs = require('fs');
var stream = fs.createReadStream(path);

stream.on('data',function(d){
  if(!found) found=!!(''+d).match(content)
});
stream.on('error',function(err){
    then(err, found);
});
stream.on('close',function(err){
    then(err, found);
});

Still trying to understand the function/callback

yvonne-liu commented 7 years ago

Aha! Was just about to post on stream

yvonne-liu commented 7 years ago

See more here: https://nodejs.org/docs/v0.4.8/api/fs.html#fs.createReadStream

lucyrose93 commented 7 years ago

Based on this spike, I've created an issue for security to prevent code injections:

10

lucyrose93 commented 7 years ago
filter(pattern, keep)
Filter the files in the stream. pattern can be:

String: A glob pattern that files must match.
Function: This function get the actual path to the file and must return a boolean.
NOTE: relative patterns are resolved against the same base cwd as the one used to set up the stream.

The optional keep parameter indicate if files matching the pattern must be kept in the stream and the others to be excluded (true), or the other way around (false) (default: true)

var fs = require('fs-stream');

fs('/files/*.*')
  .pipe(fs.filter('/files/*.md'));
lucyrose93 commented 7 years ago

https://www.npmjs.com/package/fs-stream