StractOrg / stract

web search done right
https://stract.com
GNU Affero General Public License v3.0
1.94k stars 43 forks source link

How to index from zim files? #210

Closed jayay closed 1 week ago

jayay commented 1 week ago

I downloaded some zim files and I was wondering if it is possible to create a local index from those. Is there a (more or less) easy way to do this? Is it worth the effort to get a little index to work with? I want to achieve to get a self-built index as well as a search server running, so that I can mess with different parameters.

mikkeldenker commented 1 week ago

We have a zim parser that we use to build the wikipedia index for the sidebar so it would theoretically be possible, but it would require quite extensive modifications to the indexer and webgraph indexer.

By far the easiest way to get up and running locally would be to download some warc files from commoncrawl and create the index from these. https://commoncrawl.org/latest-crawl

After you have downloaded the warc files and compiled stract you can create an index with the following steps

  1. Create the host and page webgraph using ./stract webgraph create configs/webgraph.toml where webgraph.toml looks like
    
    host_graph_base_path = "data/webgraph"
    page_graph_base_path = "data/webgraph_page"

[warc_source] type = "Local" folder = "./data" names = ["0.warc.gz", "1.warc.gz", "2.warc.gz", "3.warc.gz", "4.warc.gz"]


2. Calculate the harmonic centrality values for both graphs
`./stract centrality host data/webgraph data/centrality && ./stract centrality page data/webgraph_page/ data/centrality_page`

3. Create the search index
`./stract indexer search configs/indexer.toml`
where `indexer.toml` looks like
```toml
host_centrality_store_path = "./data/centrality"
output_path = "./data/index"

[warc_source]
type = "Local"
folder = "./data"
names = ["0.warc.gz", "1.warc.gz", "2.warc.gz", "3.warc.gz"]
  1. Start a search and api server. ./stract search-server configs/search_server.toml & ./stract api configs/api.toml where search_server.toml looks like
    cluster_id = "dev_search"
    gossip_addr = "0.0.0.0:3006"
    gossip_seed_nodes = ["0.0.0.0:3005"]
    host = "0.0.0.0:3002"
    index_path = "data/index"
    shard = 0

    and api.toml looks like

    
    cluster_id = "dev_api"
    gossip_addr = "0.0.0.0:3005"
    gossip_seed_nodes = ["0.0.0.0:3006"]
    host = "0.0.0.0:3000"
    prometheus_host = "0.0.0.0:3001"

[widgets] thesaurus_paths = []


This will start a search server on port 3002 and an api server on 3000 with the api server in the foreground. You can close the servers again using ctrl+c followed by `fg` to bring the search server process to the foreground and another ctrl+c. The job of the api server is to aggregate results from multiple search servers whereas the search server exposes a single shard of the search index.

You should now be able to start the frontend e.g. by running `npm run dev` in the `frontend` folder which should start the frontend at http://localhost:8000

I'll close this issue but feel free to comment here in case you run into any problems or anything