Closed ghthor closed 11 months ago
Replied inline
- does the scheduler utilize GPU when performing the index?
No. Only CPU is utilized.
- should I be building the dataset & index out of band from
tabby serve
and only loading them attabby serve
startup
tabby serve
support automatically loading of indexing snapshots, so tabby scheduler
can be run as an individual process to automatically build the index at a interval.
- My companies monorepo produced a dataset size of 122M. Scheduler is consuming 9G of ram and is still indexing after 3 hours of runtime, does this seem correct?
The RAM usage appears to be correct. However, the indexing strategy is not currently heavily optimized. Nonetheless, achieving the task in 3 hours seems like a reasonable outcome given the current implementation.
There're certain optimization can be done - but a larger scale indexing will likely requires some degree of distribution.
@wsxiaoys Thanks for the quick response
3. My companies monorepo produced a dataset size of 122M. Scheduler is consuming 9G of ram and is still indexing after 3 hours of runtime, does this seem correct?
It is actually still creating the index and it's been running for 5hrs now and the memory usage is up to 12G[1] according to my orchestration system(nomad); BUT I'm not sure that is actually being consumed by the rust binary. Maybe that memory usage is coming from mem mapped files? I'm not exactly sure how to check that.
root@2485851e65d4:/# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 184844 0.1 0.0 4248 2688 pts/0 Ss 18:33 0:00 bash
root 184933 0.0 0.0 5900 2304 pts/0 R+ 18:33 0:00 \_ ps auxf
root 1 17.9 0.4 4668200 315632 ? Ssl 13:45 51:44 /opt/tabby/bin/tabby scheduler --now
root@2485851e65d4:/# echo $((315632 / 1024))
308
root@2485851e65d4:/# du -hs /tmp/
0 /tmp/
root@2485851e65d4:/# du -hs /data/index/
24M /data/index/
root@2485851e65d4:/# du -hs /data/dataset/
122M /data/dataset/
root@2485851e65d4:/# ps -eo pmem,comm,pid,maj_flt,min_flt,rss,vsz --sort -rss | numfmt --header --to=iec --field 4-5 | numfmt --header --from-unit=1024 --to=iec --field 6-7
%MEM COMMAND PID MAJFL MINFL RSS VSZ
0.4 tabby 1 56 2.0M 307M 4.5G
0.0 bash 184844 0 1.5K 2.7M 4.2M
0.0 ps 186286 0 191 2.3M 5.7M
0.0 numfmt 186287 0 98 1.2M 2.5M
0.0 numfmt 186288 0 103 1.2M 2.5M
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 4668196 320648 22108 S 13.7 0.5 52:38.41 tabby
[1]
should I be building the dataset & index out of band from tabby serve and only loading them at tabby serve startup
tabby serve support automatically loading of indexing snapshots, so tabby scheduler can be run as an individual process to automatically build the index at a interval.
Excellent, that does simplify the deploy greatly knowing the scheduler
can be run against the same filestore that serve
is configured to read from.
With a careful look, it seems we're hitting https://github.com/quickwit-oss/tantivy/issues/2156 (new tantivy indexer requires much higher memory for throughput).
Will sending out a PR to fix
@wsxiaoys
I built the main branch docker container locally and :rofl: You nailed it
❯ for d in /opt/tabby/data/dataset/ /opt/tabby/data/index/; do sudo find $d; sudo du -hs $d; done
/opt/tabby/data/dataset/
/opt/tabby/data/dataset/data.jsonl.12
/opt/tabby/data/dataset/data.jsonl.11
/opt/tabby/data/dataset/data.jsonl.10
/opt/tabby/data/dataset/data.jsonl.9
/opt/tabby/data/dataset/data.jsonl.8
/opt/tabby/data/dataset/data.jsonl.7
/opt/tabby/data/dataset/data.jsonl.6
/opt/tabby/data/dataset/data.jsonl.5
/opt/tabby/data/dataset/data.jsonl.4
/opt/tabby/data/dataset/data.jsonl.3
/opt/tabby/data/dataset/data.jsonl.2
/opt/tabby/data/dataset/data.jsonl.1
/opt/tabby/data/dataset/data.jsonl
122M /opt/tabby/data/dataset/
/opt/tabby/data/index/
/opt/tabby/data/index/.tantivy-writer.lock
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.store
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.fast
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.fieldnorm
/opt/tabby/data/index/.tantivy-meta.lock
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.term
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.idx
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.pos
/opt/tabby/data/index/meta.json
/opt/tabby/data/index/.managed.json
33M /opt/tabby/data/index/
tabby serve
and only loading them attabby serve
startup