PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
20.01k stars 2.23k forks source link

not scalable - is it limitation #379

Open bp020108 opened 1 year ago

bp020108 commented 1 year ago

ingest.py is not working if you add more documents ( it worked for only 2-3 documents only). ingest.py is not able to complete embedded for more docs. I have waited for 10-15 hours but not completing. This is the limitation or any other way to solve.

I have high end HP gen server

-memory description: System Memory physical id: 1000 slot: System board or motherboard size: 384GiB capacity: 3TiB capabilities: ecc configuration: errordetection=multi-bit-ecc

aabalke33 commented 1 year ago

What size are the files you are ingesting? The number of docs should not be a problem.

bp020108 commented 1 year ago

size is not more than few MB nd i have tried with few files as well. 1-2 files only work.

bp020108 commented 1 year ago

934 KB txt file is not even completing. is there any issue with txt file?>

malakhovks commented 1 year ago

Dear @bp020108 @aabalke33 the week point here is not a RAM size. First of you need a lot of CPU cores to processing (to ingest) more docs. Also the week point is the memory bandwidth. That's why all this stuff working great on M1 or M2 chip. Read more about that you can here: How is LLaMa.cpp possible?

bp020108 commented 1 year ago

22 core and 2 thread per core. and VGA Matrox G200eW3 graphic card. Is it not powerful than M1/M2 chip? Do i need external Nvidia GPU for this ingest to work?

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2412.458 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

XX~$ lspci | grep VGA 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

*-display
description: VGA compatible controller product: Integrated Matrox G200eW3 Graphics Controller vendor: Matrox Electronics Systems Ltd. physical id: 0 bus info: pci@0000:03:00.0 version: 04 width: 32 bits clock: 66MHz capabilities: pm vga_controller bus_master cap_list rom configuration: driver=mgag200 latency=64 maxlatency=32 mingnt=16 resources: irq:16 memory:91000000-91ffffff memory:92808000-9280bfff memory:92000000-927fffff memory:c0000-dffff

bp020108 commented 1 year ago

small chunk size is the problem?

144824 chunks of text during ingest.py

2023-08-16 19:58:33,579 - INFO - ingest.py:130 - Loaded 196 documents from /home/fabtool/localgpt_llama2/SOURCE_DOCUMENTS 2023-08-16 19:58:33,579 - INFO - ingest.py:131 - Split into 144824 chunks of text

malakhovks commented 1 year ago

22 core and 2 thread per core. and VGA Matrox G200eW3 graphic card. Is it not powerful than M1/M2 chip? Do i need external Nvidia GPU for this ingest to work?

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2412.458 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s)

XX~$ lspci | grep VGA 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

Dear @bp020108 @Philip922 @PromtEngineer

44 threads it is that's very limited for this type of task. You need the GPU. For rack servers I recommend NVIDIA QUADRO RTX 4000 - 1 slot GPU with reasonable price and performance (NVIDIA CUDA® Cores 2304)

M2 GPU The M2 integrates an Apple designed ten-core (eight in some base models) graphics processing unit (GPU). Each GPU core is split into 16 execution units, which each contain eight arithmetic logic units (ALUs). In total, the M2 GPU contains up to 160 execution units or 1280 ALUs, which have a maximum floating point (FP32) performance of 3.6 TFLOPs. quadro-rtx-4000-datasheet.pdf

Is it not powerful than M1/M2 chip?

44 cores vs 1280 Yes, for such types of tasks it is more suitable and powerful, but only for test purpose, not in production.

The more clients will simultaneously connect and request inference, the more powerful GPU is needed (A100 or cluster of it).

PS.: I also have a couple of HP servers, and using 28 cores, 1000 PDFs processed about 6 hours

bp020108 commented 1 year ago

Thanks for explanation.

But as you said, you were able to run 1000 pdfs in 28 cores but i am not able to ingest more than 3 files with 22 cores. So is it any changes required in ingest.py?

I am not even there to check about number of users yet due to ingest.py stuck issue. I thought ingest files should be working with 22 cores. Please help.

I will plan for GPU later but I want to make sure this works in CPU.

On Thu, Aug 17, 2023, 1:17 AM Kyrylo Malakhov @.***> wrote:

22 core and 2 thread per core. and VGA Matrox G200eW3 graphic card. Is it not powerful than M1/M2 chip? Do i need external Nvidia GPU for this ingest to work?

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 2412.458 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

XX~$ lspci | grep VGA 03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)

*-display description: VGA compatible controller product: Integrated Matrox G200eW3 Graphics Controller vendor: Matrox Electronics Systems Ltd. physical id: 0 bus info: @.***:03:00.0 version: 04 width: 32 bits clock: 66MHz capabilities: pm vga_controller bus_master cap_list rom configuration: driver=mgag200 latency=64 maxlatency=32 mingnt=16 resources: irq:16 memory:91000000-91ffffff memory:92808000-9280bfff memory:92000000-927fffff memory:c0000-dffff

44 threads it is that's very limited for this type of task. You need the GPU. For rack servers I recommend NVIDIA QUADRO RTX 4000 - 1 slot GPU with reasonable price and performance (NVIDIA CUDA® Cores 2304)

M2 GPU The M2 integrates an Apple designed ten-core (eight in some base models) graphics processing unit https://en.wikipedia.org/wiki/Graphics_processing_unit (GPU). Each GPU core is split into 16 execution units https://en.wikipedia.org/wiki/Execution_unit, which each contain eight arithmetic logic units https://en.wikipedia.org/wiki/Arithmetic_logic_unit (ALUs). In total, the M2 GPU contains up to 160 execution units or 1280 ALUs, which have a maximum floating point (FP32) performance of 3.6 TFLOPs https://en.wikipedia.org/wiki/TFLOPS. quadro-rtx-4000-datasheet.pdf https://github.com/PromtEngineer/localGPT/files/12365531/quadro-rtx-4000-datasheet.pdf

Is it not powerful than M1/M2 chip?

44 cores vs 1280 Yes, for such types of tasks it is more suitable and powerful, but only for test purpose, not in production.

The more clients will simultaneously connect and request inference, the more powerful GPU is needed (A100 or cluster of it).

PS.: I also have a couple of HP servers, and using 28 cores, 1000 pdfs processed about 6 hours

— Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/379#issuecomment-1681628162, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB3XG3B6QSGBUUT43VDARXDXVWSOVANCNFSM6AAAAAA3SXN6DI . You are receiving this because you were mentioned.Message ID: @.***>

malakhovks commented 1 year ago

@bp020108 my advice to you firstly try to wait the end of ingest process. Just run it and leave for the whole night. It depends not only from nymber of PDFs but and from it size.

Here is the plan:

  1. ssh to you machine
  2. cd into project's folder
  3. activate conda environment
  4. run python ingest.py --device_type cpu
  5. press CTRL+Z to pause the process
  6. then run bg command to make it run in the background
  7. then run jobs -l lists the process number required for disown command
  8. Then you can run disown %1 (replace 1 with the process number output by jobs -l)
  9. exit ssh and go to sleep
  10. In the morning, try to run python run_localGPT.py
bp020108 commented 1 year ago

thanks. I will try this, but is there way to check if ingest process is completed or not?

and i have tried some other project to see if it is CPU issue or not but in that project ingest.py file is completing. it takes time for more files but it is completing and it shows percentage for process to track so is it plan to enhance ingest file in this project to make it faster or show percentage to track the process?

creuzerm commented 1 year ago

Are you trying a CPU ingest or a GPU ingest?

I am using a much smaller, older dual cpu Intel(R) Xeon(R) CPU E5-2670 with 80GB of ram - not that the ram matters as it only ever uses less than 3.5 GB. Running as a container in Proxmox so I get pretty usage graphs. Also it only ever graphs at 40% CPU utilization for me so I am assuming it's parallelizing at core count and not thread count which is probably optimal given my understanding of hyperthreading.

a CPU ingest "python ingest.py --device_type cpu" on 34 PDFs for a bit over 500MB takes 5 hours with everything else stock config. Fresh git pull, drop docs into SOURCE_DOCUMENTS, run the ingest command.

You should be able to complete the ingestion with the --decice_type cpu flag. If you still have issues, try the files individually see if you have a file that is giving fits for some unknown reason?

When the process finishes, we get dropped back to the command prompt. I look at my system utilization graphs to see how long it took the next day.

bp020108 commented 1 year ago

thanks for comparison. I am trying 2 MB pdf files which is not being done in few mins and that's why i am thinking there is something wrong.
Can you please provide these details so i can try with that and see if any improvement:

model basename model ID Chunk size in ingest.py CTX_SIZE in run_localgpt.py

what python version ?

creuzerm commented 1 year ago

Yeah, if we look at https://github.com/PromtEngineer/localGPT/blob/main/constants.py

We see MODEL_ID = "TheBloke/Llama-2-7B-Chat-GGML" MODEL_BASENAME = "llama-2-7b-chat.ggmlv3.q4_0.bin"

Chunksize looks to be formulaic in https://github.com/PromtEngineer/localGPT/blob/main/ingest.py line 57 so that could be a difference for us. chunksize = round(len(paths) / n_workers)

CTX size from https://github.com/PromtEngineer/localGPT/blob/main/run_localGPT.py line 52 max_ctx_size = 2048

Python 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] on linux

Even the sample file, constititution.pdf took more than a few minutes for me to ingest. Once things are rolling on this referenced machine I am getting about an hour per 100mb of documents pretty consistently as I am growing my documents

Also, I see now in https://github.com/PromtEngineer/localGPT/blob/main/constants.py the ingest thread count to be the CPU count. If you are getting a lower than expected count for whatever reason, this could artificially bottleneck your ingest process. Also this value is scaling our chunksize and maybe you are seeing some poor results from this math? Maybe see what os.cpu_count() gives you for a number? My system is giving me: (localGPT) dev@localGPT:/home/localGPT# python Python 3.11.4 (main, Jul 5 2023, 13:45:01) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import os os.cpu_count() 32

Also, I tried an ingest of a couple of PDF documents on my laptop, a 1 gen old i7 with an MX450 and killed the process after 18 hours because I needed to do work.

creuzerm commented 1 year ago

Also, I think the model details have no bearing for us on the INGEST process. Other than we need to use a compatible embedding - HuggingFaceInstructEmbeddings by default in our case.

Our ingest process is taking the long strings, breaking them down into a manageable number of tokens (roughly words), mapping the words to numeric indexes, and storing those results as vectors of numbers to the database. - I think. 80% confident this is what's going on. I likely have a detail or two wrong.

creuzerm commented 1 year ago

Oh, I see your issue #374 and you have a quarter million chunks for a few text files. If we have only tens of thousands of words(tokens) and 150k chunks prior in this thread. You may simply be breaking things down into too small of pieces maybe? (not your fault, just a oddity of the source files and your machine)

https://www.pinecone.io/learn/chunking-strategies/ I see some thoughts in this article that could steer towards an intentional chunk size rather than an accidental chunk size.

https://github.com/PromtEngineer/localGPT/blob/main/ingest.py line 57 may need to be very different for you. Report back what that math works out to be? There are other, related issues such as #382 that could be bundled up in this thread.

bp020108 commented 1 year ago

thanks for details. Let me read chunking strategies to figure out what would be my chunk_size? currently I am using chunk_size=1000 and chunk_overlap=200 (line 124 & 126 in ingest.py)

Python 3.11.2 (main, Mar 27 2023, 23:42:44) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import os os.cpu_count() 88

bp020108 commented 1 year ago

I have tried chunk_size=500 (overlap =50) and ctx_size=2048

364 docs = total 237 MB = 538K chunks = took 11 hours.

But now run_localGPT_API.py file is not completing the process? (run_localGPT.py is fine). After ingesting done why API file is taking time?

2023-08-18 22:10:19,510 - INFO - ingest.py:34 - Loading document batch 2023-08-18 22:10:19,532 - INFO - ingest.py:34 - Loading document batch 2023-08-18 22:11:39,047 - INFO - ingest.py:130 - Loaded 364 documents from /home/fabtool/localgpt_llama2/SOURCE_DOCUMENTS 2023-08-18 22:11:39,047 - INFO - ingest.py:131 - Split into 538579 chunks of text 2023-08-18 22:11:39,705 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large load INSTRUCTOR_Transformer 2023-08-18 22:11:39,749 - INFO - instantiator.py:21 - Created a temporary directory at /tmp/tmponmulqkk 2023-08-18 22:11:39,749 - INFO - instantiator.py:76 - Writing /tmp/tmponmulqkk/_remote_module_non_scriptable.py max_seq_length 512 2023-08-18 22:11:43,822 - INFO - ctypes.py:22 - Successfully imported ClickHouse Connect C data optimizations 2023-08-18 22:11:43,829 - INFO - json_impl.py:45 - Using orjson library for writing JSON byte strings 2023-08-18 22:11:44,160 - INFO - duckdb.py:506 - loaded in 7262 embeddings 2023-08-18 22:11:44,161 - INFO - duckdb.py:518 - loaded in 1 collections 2023-08-18 22:11:44,163 - INFO - duckdb.py:107 - collection with name langchain already exists, returning existing collection 2023-08-19 08:59:09,714 - INFO - duckdb.py:460 - Persisting DB to disk, putting it in the save folder: /home/XXX/localgpt_llama2/DB 2023-08-19 08:59:30,722 - INFO - duckdb.py:460 - Persisting DB to disk, putting it in the save folder: /home/XXX/localgpt_llama2/DB (localgpt_llama2)XXX@MTN58R01C002:~/localgpt_llama2$ (localgpt_llama2) XXXl@yyC002:~/localgpt_llama2$ (localgpt_llama2) XX@yy1C002:~/localgpt_llama2$ python run_localGPT_API.py --device_type cpu load INSTRUCTOR_Transformer max_seq_length 512