How to index from zim files?

We have a zim parser that we use to build the wikipedia index for the sidebar so it would theoretically be possible, but it would require quite extensive modifications to the indexer and webgraph indexer.

By far the easiest way to get up and running locally would be to download some warc files from commoncrawl and create the index from these. https://commoncrawl.org/latest-crawl

After you have downloaded the warc files and compiled stract you can create an index with the following steps

Create the host and page webgraph using ./stract webgraph create configs/webgraph.toml where webgraph.toml looks like
```
host_graph_base_path = "data/webgraph"
page_graph_base_path = "data/webgraph_page"
```

[warc_source] type = "Local" folder = "./data" names = ["0.warc.gz", "1.warc.gz", "2.warc.gz", "3.warc.gz", "4.warc.gz"]


2. Calculate the harmonic centrality values for both graphs
`./stract centrality host data/webgraph data/centrality && ./stract centrality page data/webgraph_page/ data/centrality_page`

3. Create the search index
`./stract indexer search configs/indexer.toml`
where `indexer.toml` looks like
```toml
host_centrality_store_path = "./data/centrality"
output_path = "./data/index"

[warc_source]
type = "Local"
folder = "./data"
names = ["0.warc.gz", "1.warc.gz", "2.warc.gz", "3.warc.gz"]

Start a search and api server. ./stract search-server configs/search_server.toml & ./stract api configs/api.toml where search_server.toml looks like

cluster_id = "dev_search"
gossip_addr = "0.0.0.0:3006"
gossip_seed_nodes = ["0.0.0.0:3005"]
host = "0.0.0.0:3002"
index_path = "data/index"
shard = 0

and api.toml looks like


cluster_id = "dev_api"
gossip_addr = "0.0.0.0:3005"
gossip_seed_nodes = ["0.0.0.0:3006"]
host = "0.0.0.0:3000"
prometheus_host = "0.0.0.0:3001"

[widgets] thesaurus_paths = []


This will start a search server on port 3002 and an api server on 3000 with the api server in the foreground. You can close the servers again using ctrl+c followed by `fg` to bring the search server process to the foreground and another ctrl+c. The job of the api server is to aggregate results from multiple search servers whereas the search server exposes a single shard of the search index.

You should now be able to start the frontend e.g. by running `npm run dev` in the `frontend` folder which should start the frontend at http://localhost:8000

I'll close this issue but feel free to comment here in case you run into any problems or anything

StractOrg / stract

How to index from zim files? #210