PeernetOfficial / core

Core library. Use this to create a new Peernet application.
MIT License
36 stars 5 forks source link

Base search with Indexes #67

Closed Akilan1999 closed 2 years ago

Akilan1999 commented 3 years ago

Index database

For the Index database we are using a Sqlite database: Database Structure:

CREATE TABLE `search_indices` (`id` text,`hash` text,`key_hash` text,PRIMARY KEY (`id`));

Features Implemented:

Basic normalization and sanitation

This implementation does the basics such as:

  1. Ensures the string has no double space
  2. replaces _ and - with a space
  3. removes diacritics

Ex:

NormalizeWords("français")
NormalizeWords("testé-lol_What to do-idk")

// result
francais
teste lol What to do idk

To test it out

cd search/

go test . 

function calls from warehouse

Refers to search generate index and delete indexes called from the appropriate functions in the warehouse package.

func (wh *Warehouse) CreateFile(data io.Reader, fileSize uint64) (hash []byte, status int, err error) {
    // create a temporary file to hold the body content
    tmpFile, err := wh.tempFile()
    if err != nil {
        return nil, StatusErrorCreateTempFile, err
    }

    tmpFileName := tmpFile.Name()
    // generate search index for the file
    _, err = search.GenerateIndexes(tmpFileName)
    if err != nil {
        return nil, 0, err
    } ....
// DeleteFile deletes a file from the warehouse
func (wh *Warehouse) DeleteFile(hash []byte) (status int, err error) {
    path, _, status, err := wh.FileExists(hash)
    if status != StatusOK {
        return status, err
    }

    if err := os.Remove(path); err != nil {
        return StatusErrorDeleteFile, err
    }

    // Remove file generated indexes
    err = search.RemoveIndexesHash(hash)
    if err != nil {
        return 0, err
    }

    return StatusOK, nil
}

New method and struct added to warehouse

New Struct introduced

// SearchResult Search Response
type SearchResult struct {
    Path     string
    FileInfo os.FileInfo
}

Added search function to search if a file is available locally in the warehouse

// SearchFile Searches for file from the warehouse
func (wh *Warehouse) SearchFile(hash []byte) (searchResults []SearchResult, status int, err error) {
    // searches for file in index
    response, err := search.Search(hash)
    if err != nil {
        return nil, StatusSearchError, err
    }

    // Return empty response if the response length size is 0
    if len(response) == 0 {
        return nil, StatusSearchEmpty, nil
    }

    for i := range response {
        path, fileInfo, status, err := wh.FileExists(protocol.HashData([]byte(response[i])))
        if err != nil {
            return nil, status, err
        }
        if status == StatusOK {
            searchResults = append(searchResults, SearchResult{Path: path, FileInfo: fileInfo})
        }
    }

    return searchResults, StatusOK, nil
}

Response codes introduced

StatusSearchEmpty         = 17 // Search empty response
StatusSearchError           = 18 // Search Error
Kleissner commented 2 years ago

Create set of hashes based on simple characteristics such as:

Basic normalization and sanitation Upper Case Lower Case Individual words

Is this still the case? Normalization and sanitization including lowercasing must happen before hashing.

The only general reason for multiple hashes is if there are multiple words (or if we figure out that a single word may lead to potential 2 new strong words after normalization, however that is not the case with just lowercasing it etc).

Akilan1999 commented 2 years ago

I will include lower case when normalizing it

Kleissner commented 2 years ago

Can you adjust your editor workspace settings to autodetect line endings and tabs? The comparison https://github.com/PeernetOfficial/core/pull/67/files is a mess and makes it impossible to check which code was actually changed.

Can you create a new PR with fixed settings so that we have a comparison of changed code?

Kleissner commented 2 years ago

Another thing we need is "unindexing" - when files were removed from the blockchain. That's why we need to keep a reference counter or something to know when we can remove a hash from the index.

Akilan1999 commented 2 years ago

I will recreate the PR

Kleissner commented 2 years ago

There's another problem: GORM relies on https://github.com/mattn/go-sqlite3. Unfortunately it requires gcc, which makes it non-native Go code. We don't want to suddenly introduce gcc and other flags (go-sqlite3 requires to specify custom tags depending on the OS).

We'll have to find another solution, either a native sqlite implementation (if there is one), or some other kind of database storage.