Open fawazahmed0 opened 1 year ago
var my_lzma = require("lzma");
const compressPromisified = (stringOrByteArray, mode, on_progress_func) => {
return new Promise((resolve, reject) => {
my_lzma.compress(stringOrByteArray, mode, (result, error) => {
if (error) reject(error)
else resolve(result)
}, on_progress_func)
})
}
const decompressPromisified = (byteArray, on_progress_func) => {
return new Promise((resolve, reject) => {
my_lzma.decompress(byteArray, (result, error) => {
if (error) reject(error)
else resolve(result)
}, on_progress_func)
})
}
async function begin() {
let compressedValue = await compressPromisified("hello world", 9)
console.log(compressedValue)
let decompressedValue = await decompressPromisified(compressedValue)
console.log(decompressedValue)
}
begin()
BM25 scoring pure JS 7zipped index fetch index from remote
https://github.com/sql-js/sql.js (sqllite, support powerful search, has BM25 etc) https://www.sqlitetutorial.net/sqlite-full-text-search/
https://www.sqlite.org/fts5.html (official, read this)
(fts5 is not enabled in sql.js, so need to recompile with that option and refer their issues , they have actions scripts which does all builiding etc ) (also keep fts3/fts4 enabled , it has options, which is not there in fts5)
create npm package first (refer other search engine packages, and follow similar struture), ability for future PR
https://github.com/LZMA-JS/LZMA-JS (use this) (promisfy it) (link and link2 )
fts5 chinese is not supported in unicode61, experimental trigram might help target all languages, and can enable icu https://www.sqlite.org/compile.html
https://www.sqlite.org/amalgamation.html
https://mjtsai.com/blog/2015/07/31/sqlite-fts5/#:~:text=The%20principle%20difference%20between%20FTS3,divided%20between%20multiple%20database%20records.
sqlite3 already supports official wasm builds for browser, refer https://www.sqlite.org/releaselog/3_40_1.html https://sqlite.org/wasm/doc/trunk/index.md (so use it, instead of sql.js) https://developer.chrome.com/blog/sqlite-wasm-in-the-browser-backed-by-the-origin-private-file-system/
can use USE_ICU and see this
(fts5 doesn't support icu tokenizer ,it's only for fts4)
using fts5 trigram should help support all languages
pass CFLAGS="-DSQLITE_ENABLE_ICU" with make command
to enable ICU sudo apt update && sudo apt install libicu-dev libsqlite3-dev -y
divide the compressed index by size to allow fetching from free cdns etc
see how tensirflow does i.e divided into 4 mb. https://github.com/fawazahmed0/quran-verse-detection/tree/master
https://stackoverflow.com/questions/1778538/how-many-gcc-optimization-levels-are-there https://github.com/fawazahmed0/tiger/blob/master/.github/workflows/sql.yml (actions)
enabling icu increase binary size, so only use FTS5 with trigram
ref: https://github.com/fawazahmed0/sqlite-wasm-demo
make module which can be imported , should be importable using https://www.jsdelivr.com/esm can append the js scripts, if not importable
also return database, so user can do whatever he wants to do with
db.exec
remove diacritices when adding to table and when searching
size of table gets huge, so need to divide into shards, using arraybuffer bytelength, slice etc, named as file.sqlite.lzma.001 , refer tfjs code, how model.json is fetched
Divide into tasks and start working on it
or its better to make search engine from scratch using pure JS with BM25 scoring ,icu , substring match using string.includes(""), regex match etc
https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables https://github.com/FurkanToprak/OkapiBM25 (this one) https://github.com/winkjs/wink-bm25-text-search https://github.com/zjohn77/retrieval https://github.com/redaxo/redaxo/commit/154023ecd4c96e500f5865e809eddf0f58dc7527#diff-d6747527bdb2eac9e2dceb8470a2bdb3e7ca4849c7589c8348d49b5f2fe508c4R190
checking somehing is regexp https://tc39.es/ecma262/#sec-isregexp https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#special_handling_for_regexes (underscore and lodash has isRegExp function, can copy from there) https://www.npmjs.com/package/lodash.isregexp https://lodash.com/docs/4.17.15#isRegExp
can use JSON.parse, string functions etc, but while saving divide the json to less than 4mb to be parsable. (keep the lzma blob to less than 4mb, and parse & stringify each json in array individually. after fetching the data, just need to wrap
[
]
around it to make it as array and parse json individually init)storing the index in json by precalculating things
can use localcompare for string comparison: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/Collator https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/compare https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/compare#using_compare_for_array_search returns 0 when equal:
use text to link fragments for highlighting multiple words https://web.dev/text-fragments/#multiple-text-fragments-in-one-url
https://web.dev/text-fragments/#programmatic-text-fragment-link-generation
compare the scores of this and this to make sure JS version calculates things properly
I think no need to regex etc.
(can do substring matching by partial mataching the tokens
Can use rolllup for library building. , demo or maybe vitejs
7z, if size can be reduced more
limitation of nodejs to read large string/json file, so need to feed each json obj individually
elasticlunr also does stemming, stop word filter etc
substring matching (both on query and word)
see how elastic search works internallay https://www.elastic.co/what-is/elasticsearch#:~:text=Elasticsearch%20uses%20a%20data%20structure%20called%20an%20inverted%20index%2C%20which%20is%20designed%20to%20allow%20very%20fast%20full%2Dtext%20searches.%20An%20inverted%20index%20lists%20every%20unique%20word%20that%20appears%20in%20any%20document%20and%20identifies%20all%20of%20the%20documents%20each%20word%20occurs%20in.
https://buildatscale.tech/how-elasticsearch-works-internally/
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
https://stackoverflow.com/questions/22044833/how-does-elastic-search-keep-the-index
https://stackoverflow.com/questions/22044833/how-does-elastic-search-keep-the-index get a way to extract elastic search inverted index (this way tokenization, stemming etc will be handled by elastic search, we only need inverted index)
No need to overkill (stemmming etc)
see ngram tokenizer https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
edge_ngram tokenizer (ngram i.e 3+, for both query and document , need to modify score calculation for partial match, think how people perform search and only implement accordingly, no need of overkill, refer other searches like google, bing, mdbook, elasticsearch, lunr.js, elasticlunr etc ) https://stackoverflow.com/questions/60523240/elasticsearch-why-exact-match-has-lower-score-than-partial-match https://stackoverflow.com/questions/64530450/how-to-make-shorter-closer-token-match-more-relevant-edge-ngram https://github.com/FurkanToprak/OkapiBM25/blob/master/BM25.ts
plain whitespace segmentation, if intl.segementar not available
async and parallel index building support etc
demo with single edtion of each language (for eng must khatb, for other language, maybe machine goog trans)