fawazahmed0 / islamic-project

List of Islamic Projects
2 stars 0 forks source link

Client side Search engine for q & h #7

Open fawazahmed0 opened 1 year ago

fawazahmed0 commented 1 year ago

BM25 scoring pure JS 7zipped index fetch index from remote

https://github.com/sql-js/sql.js (sqllite, support powerful search, has BM25 etc) https://www.sqlitetutorial.net/sqlite-full-text-search/

https://www.sqlite.org/fts5.html (official, read this)

(fts5 is not enabled in sql.js, so need to recompile with that option and refer their issues , they have actions scripts which does all builiding etc ) (also keep fts3/fts4 enabled , it has options, which is not there in fts5)

create npm package first (refer other search engine packages, and follow similar struture), ability for future PR

https://github.com/LZMA-JS/LZMA-JS (use this) (promisfy it) (link and link2 )

fts5 chinese is not supported in unicode61, experimental trigram might help target all languages, and can enable icu https://www.sqlite.org/compile.html

https://www.sqlite.org/amalgamation.html

https://mjtsai.com/blog/2015/07/31/sqlite-fts5/#:~:text=The%20principle%20difference%20between%20FTS3,divided%20between%20multiple%20database%20records.

sqlite3 already supports official wasm builds for browser, refer https://www.sqlite.org/releaselog/3_40_1.html https://sqlite.org/wasm/doc/trunk/index.md (so use it, instead of sql.js) https://developer.chrome.com/blog/sqlite-wasm-in-the-browser-backed-by-the-origin-private-file-system/

can use USE_ICU and see this

(fts5 doesn't support icu tokenizer ,it's only for fts4)

using fts5 trigram should help support all languages

pass CFLAGS="-DSQLITE_ENABLE_ICU" with make command

make CFLAGS="-DSQLITE_ENABLE_ICU `pkg-config --libs --cflags icu-uc icu-io`"

to enable ICU sudo apt update && sudo apt install libicu-dev libsqlite3-dev -y

divide the compressed index by size to allow fetching from free cdns etc

see how tensirflow does i.e divided into 4 mb. https://github.com/fawazahmed0/quran-verse-detection/tree/master

https://stackoverflow.com/questions/1778538/how-many-gcc-optimization-levels-are-there https://github.com/fawazahmed0/tiger/blob/master/.github/workflows/sql.yml (actions)

enabling icu increase binary size, so only use FTS5 with trigram

ref: https://github.com/fawazahmed0/sqlite-wasm-demo

make module which can be imported , should be importable using https://www.jsdelivr.com/esm can append the js scripts, if not importable

also return database, so user can do whatever he wants to do with db.exec

remove diacritices when adding to table and when searching

size of table gets huge, so need to divide into shards, using arraybuffer bytelength, slice etc, named as file.sqlite.lzma.001 , refer tfjs code, how model.json is fetched

Divide into tasks and start working on it

or its better to make search engine from scratch using pure JS with BM25 scoring ,icu , substring match using string.includes(""), regex match etc

https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables https://github.com/FurkanToprak/OkapiBM25 (this one) https://github.com/winkjs/wink-bm25-text-search https://github.com/zjohn77/retrieval https://github.com/redaxo/redaxo/commit/154023ecd4c96e500f5865e809eddf0f58dc7527#diff-d6747527bdb2eac9e2dceb8470a2bdb3e7ca4849c7589c8348d49b5f2fe508c4R190

checking somehing is regexp https://tc39.es/ecma262/#sec-isregexp https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#special_handling_for_regexes (underscore and lodash has isRegExp function, can copy from there) https://www.npmjs.com/package/lodash.isregexp https://lodash.com/docs/4.17.15#isRegExp

can use JSON.parse, string functions etc, but while saving divide the json to less than 4mb to be parsable. (keep the lzma blob to less than 4mb, and parse & stringify each json in array individually. after fetching the data, just need to wrap [ ] around it to make it as array and parse json individually init)

storing the index in json by precalculating things

can use localcompare for string comparison: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/Collator https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/compare https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Collator/compare#using_compare_for_array_search returns 0 when equal:

let str = "ٱلَّذِينَ يُؤۡمِنُونَ بِٱلۡغَيۡبِ وَيُقِيمُونَ ٱلصَّلَوٰةَ وَمِمَّا رَزَقۡنَٰهُمۡ يُنفِقُونَ"
let newstr = str.normalize("NFD").replace(/\p{Diacritic}|\p{Mark}|\p{Extender}|\p{Bidi_Control}/gu, "")
console.log(new Intl.Collator(undefined,{usage:"search",sensitivity:"base",ignorePunctuation:true}).compare(str,newstr))

Refer other search engines ,their features and how their internal works(like index storage format etc), and can get ideas from it

use text to link fragments for highlighting multiple words https://web.dev/text-fragments/#multiple-text-fragments-in-one-url

https://web.dev/text-fragments/#programmatic-text-fragment-link-generation

compare the scores of this and this to make sure JS version calculates things properly

I think no need to regex etc.

(can do substring matching by partial mataching the tokens

Can use rolllup for library building. , demo or maybe vitejs

7z, if size can be reduced more

limitation of nodejs to read large string/json file, so need to feed each json obj individually

elasticlunr also does stemming, stop word filter etc

substring matching (both on query and word)

see how elastic search works internallay https://www.elastic.co/what-is/elasticsearch#:~:text=Elasticsearch%20uses%20a%20data%20structure%20called%20an%20inverted%20index%2C%20which%20is%20designed%20to%20allow%20very%20fast%20full%2Dtext%20searches.%20An%20inverted%20index%20lists%20every%20unique%20word%20that%20appears%20in%20any%20document%20and%20identifies%20all%20of%20the%20documents%20each%20word%20occurs%20in.

https://buildatscale.tech/how-elasticsearch-works-internally/

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html

https://stackoverflow.com/questions/22044833/how-does-elastic-search-keep-the-index

https://stackoverflow.com/questions/22044833/how-does-elastic-search-keep-the-index get a way to extract elastic search inverted index (this way tokenization, stemming etc will be handled by elastic search, we only need inverted index)

No need to overkill (stemmming etc)

see ngram tokenizer https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html

edge_ngram tokenizer (ngram i.e 3+, for both query and document , need to modify score calculation for partial match, think how people perform search and only implement accordingly, no need of overkill, refer other searches like google, bing, mdbook, elasticsearch, lunr.js, elasticlunr etc ) https://stackoverflow.com/questions/60523240/elasticsearch-why-exact-match-has-lower-score-than-partial-match https://stackoverflow.com/questions/64530450/how-to-make-shorter-closer-token-match-more-relevant-edge-ngram https://github.com/FurkanToprak/OkapiBM25/blob/master/BM25.ts

plain whitespace segmentation, if intl.segementar not available

async and parallel index building support etc

demo with single edtion of each language (for eng must khatb, for other language, maybe machine goog trans)

fawazahmed0 commented 1 year ago
var my_lzma = require("lzma");

const compressPromisified = (stringOrByteArray, mode, on_progress_func) => {
    return new Promise((resolve, reject) => {
        my_lzma.compress(stringOrByteArray, mode, (result, error) => {
            if (error) reject(error)
            else resolve(result)
        }, on_progress_func)
    })

}

const decompressPromisified = (byteArray, on_progress_func) => {
    return new Promise((resolve, reject) => {
        my_lzma.decompress(byteArray, (result, error) => {
            if (error) reject(error)
            else resolve(result)
        }, on_progress_func)
    })

}

async function begin() {
    let compressedValue = await compressPromisified("hello world", 9)
    console.log(compressedValue)
    let decompressedValue = await decompressPromisified(compressedValue)
    console.log(decompressedValue)
}
begin()