ghthor commented 11 months ago

does the scheduler utilize GPU when performing the index?
should I be building the dataset & index out of band from tabby serve and only loading them at tabby serve startup
My companies monorepo produced a dataset size of 122M. Scheduler is consuming 9G of ram and is still indexing after 3 hours of runtime, does this seem correct?

wsxiaoys commented 11 months ago

Replied inline

does the scheduler utilize GPU when performing the index?

No. Only CPU is utilized.

should I be building the dataset & index out of band from tabby serve and only loading them at tabby serve startup

tabby serve support automatically loading of indexing snapshots, so tabby scheduler can be run as an individual process to automatically build the index at a interval.

My companies monorepo produced a dataset size of 122M. Scheduler is consuming 9G of ram and is still indexing after 3 hours of runtime, does this seem correct?

The RAM usage appears to be correct. However, the indexing strategy is not currently heavily optimized. Nonetheless, achieving the task in 3 hours seems like a reasonable outcome given the current implementation.

There're certain optimization can be done - but a larger scale indexing will likely requires some degree of distribution.

ghthor commented 11 months ago

@wsxiaoys Thanks for the quick response

3. My companies monorepo produced a dataset size of 122M. Scheduler is consuming 9G of ram and is still indexing after 3 hours of runtime, does this seem correct?

It is actually still creating the index and it's been running for 5hrs now and the memory usage is up to 12G[1] according to my orchestration system(nomad); BUT I'm not sure that is actually being consumed by the rust binary. Maybe that memory usage is coming from mem mapped files? I'm not exactly sure how to check that.

root@2485851e65d4:/# ps auxf
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      184844  0.1  0.0   4248  2688 pts/0    Ss   18:33   0:00 bash
root      184933  0.0  0.0   5900  2304 pts/0    R+   18:33   0:00  \_ ps auxf
root           1 17.9  0.4 4668200 315632 ?      Ssl  13:45  51:44 /opt/tabby/bin/tabby scheduler --now
root@2485851e65d4:/# echo $((315632 / 1024))
308

root@2485851e65d4:/# du -hs /tmp/
0   /tmp/
root@2485851e65d4:/# du -hs /data/index/
24M /data/index/
root@2485851e65d4:/# du -hs /data/dataset/
122M    /data/dataset/

root@2485851e65d4:/# ps -eo pmem,comm,pid,maj_flt,min_flt,rss,vsz --sort -rss | numfmt --header --to=iec --field 4-5 | numfmt --header --from-unit=1024 --to=iec --field 6-7                   
%MEM COMMAND             PID  MAJFL  MINFL   RSS    VSZ
 0.4 tabby                 1     56    2.0M   307M    4.5G
 0.0 bash             184844      0   1.5K  2.7M   4.2M
 0.0 ps               186286      0    191  2.3M   5.7M
 0.0 numfmt           186287      0     98  1.2M   2.5M
 0.0 numfmt           186288      0    103  1.2M   2.5M

     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                           
      1 root      20   0 4668196 320648  22108 S  13.7   0.5  52:38.41 tabby

[1]

Screenshot_2023-11-10_13-36-39

Inband scheduler

should I be building the dataset & index out of band from tabby serve and only loading them at tabby serve startup

tabby serve support automatically loading of indexing snapshots, so tabby scheduler can be run as an individual process to automatically build the index at a interval.

Excellent, that does simplify the deploy greatly knowing the scheduler can be run against the same filestore that serve is configured to read from.

wsxiaoys commented 11 months ago

With a careful look, it seems we're hitting https://github.com/quickwit-oss/tantivy/issues/2156 (new tantivy indexer requires much higher memory for throughput).

Will sending out a PR to fix

ghthor commented 11 months ago

@wsxiaoys

I built the main branch docker container locally and :rofl: You nailed it

Runtime 2mins :rofl:

Screenshot_2023-11-10_15-23-04

❯ for d in /opt/tabby/data/dataset/ /opt/tabby/data/index/; do sudo find $d; sudo du -hs $d; done
/opt/tabby/data/dataset/
/opt/tabby/data/dataset/data.jsonl.12
/opt/tabby/data/dataset/data.jsonl.11
/opt/tabby/data/dataset/data.jsonl.10
/opt/tabby/data/dataset/data.jsonl.9
/opt/tabby/data/dataset/data.jsonl.8
/opt/tabby/data/dataset/data.jsonl.7
/opt/tabby/data/dataset/data.jsonl.6
/opt/tabby/data/dataset/data.jsonl.5
/opt/tabby/data/dataset/data.jsonl.4
/opt/tabby/data/dataset/data.jsonl.3
/opt/tabby/data/dataset/data.jsonl.2
/opt/tabby/data/dataset/data.jsonl.1
/opt/tabby/data/dataset/data.jsonl
122M    /opt/tabby/data/dataset/
/opt/tabby/data/index/
/opt/tabby/data/index/.tantivy-writer.lock
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.store
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.fast
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.fieldnorm
/opt/tabby/data/index/.tantivy-meta.lock
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.term
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.idx
/opt/tabby/data/index/81609d57a6f84b789855c51f34c00370.pos
/opt/tabby/data/index/meta.json
/opt/tabby/data/index/.managed.json
33M /opt/tabby/data/index/

TabbyML / tabby

[deployment architecture] general questions about how to run the `scheduler` #752

Inband scheduler

Runtime 2mins :rofl: