Typo in requirements.txt

synchronic1 commented 3 months ago

I think there is a typo in requirements.txt with install>=1.3.5 resulting in error ERROR: Could not find a version that satisfies the requirement install>=1.3.5 (from versions: none) ERROR: No matching distribution found for install>=1.3.5

PIP will also uninstall already installed dependencies with the same versions. I don't know if this is based on how PIP is called.

 Attempting uninstall: typing_extensions
    Found existing installation: typing_extensions 4.7.1
    Uninstalling typing_extensions-4.7.1:
      Successfully uninstalled typing_extensions-4.7.1
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.1
    Uninstalling tqdm-4.66.1:
      Successfully uninstalled tqdm-4.66.1
  Attempting uninstall: requests
    Found existing installation: requests 2.31.0
    Uninstalling requests-2.31.0:
      Successfully uninstalled requests-2.31.0
  Attempting uninstall: torch
    Found existing installation: torch 2.0.1
    Uninstalling torch-2.0.1:
      Successfully uninstalled torch-2.0.1
  Attempting uninstall: torchvision
    Found existing installation: torchvision 0.15.2
    Uninstalling torchvision-0.15.2:
      Successfully uninstalled torchvision-0.15.2
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.3
    Uninstalling transformers-4.42.3:
      Successfully uninstalled transformers-4.42.3
**ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.0.2+cu118 requires torch==2.0.1, but you have torch 2.4.0 which is incompatible.**

I also wanted to add that as of the latest v2 update, the terminal will return responses very quickly, but the browser is very slow and CPU util shows that it is doing work.

There is also an additional error in the browser that is not allowing the inference results to serve

abgulati commented 3 months ago

Thanks for reporting! Do share a screenshot of your settings screen showing the LLM server in use. GGUFs do inference faster so a slow down is to be expected, as we’re now running LLMs in their native transformers format. Even though on-the-fly quantization is used by default, it will be slower than a GGUF though with the benefits of running LLMs natively from their source, without the need to download and or convert them GGUFs. Also, no need to wait for llama.cpp updates when new models drop! Simply add it to the list and you’re good to go, with at most a pip update to transformers, far easier than a llama.cpp recompile and update!

Now as for the referencing issue, please share the lars_server_log and hf_server_log files so I can investigate further.

abgulati commented 3 months ago

Also, do make sure you’ve installed PyTorch correctly via the CUDA branch as your CPU should not be used for inferencing if setup correctly for GPUs.

Since you’ve had a similar issue with llama.cpp, I’d recommend the following:

Verify Nvidia driver and CUDA toolkit setup
Compile a sample CUDA hello world program
Verify correct PyTorch setup

In the coming days, I’ll update the containers so it should ease deployments up further.

For GPU use in containers, ensure the Nvidia container toolkit is correctly setup in WSL on windows as described in the README: https://github.com/abgulati/LARS?tab=readme-ov-file#building--running-the-nvidia-cuda-gpu-enabled-container

abgulati commented 3 months ago

Also, if you have a previous installation of PyTorch, it may be a CPU only setup or for an older version of CUDA so I would recommend running pip uninstall torch and reinstalling for the correct CUDA version from the PyTorch page!

synchronic1 commented 3 months ago

Uninstalled/reinstalled via PIP

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 where 118 is your Cuda version toolkit installed

It is verified that PyTorch is installed for the correct Cuda version

synchronic1 commented 3 months ago

abgulati commented 3 months ago

Hi @synchronic1 as per your last update to me offline, you did get LARS working correctly with both HF-Waitress & llama.cpp after a re-install of PyTorch and everything including GPU utilization is functioning as expected.

Do confirm so I can proceed to close this issue!

synchronic1 commented 3 months ago

The issue was solved by editing the requirements.txt file to remove the reference to install>=1.3.5

Also, the installation of PyTorch version to match the Cuda version indicated by PyTorch seems to have solved some issues.

hf_server_log.log lars_server_log.log llama.log

abgulati commented 3 months ago

I’ve looked into that specific library and deemed it safe to be removed from the requirements list and thus have done so: All requirement text files have been updated.

Glad to hear the suggestion to verify and reinstall PyTorch was helpful in resolving the issue.

Thanks for reporting and sharing confirmation that the issue is resolved! Closing 🍻

abgulati commented 3 months ago

Also having been through the logs, here are my findings:

No issue in hf_server_log: The error “AttributeError: 'Phi3Config' object has no attribute 'head_dim' “ is an expected one for Phi3 as that attribute value indeed doesn’t exist: the /health endpoint used to check if the server is online will query a number of model details, and different LLMs may or may not contain values for some architectural parameters. In that case, it’s noted and the application moves on, returning whichever details it can, which is what we’re seeing here. I will update the server output to further clarify this in a future update.
lars_server_log: few different things to unpack here: A) It indicates that llama.cpp was already running on your system, and since it was launched external to LARS, LARS cannot terminate a process it doesn’t own so that’s good and expected behavior. LARS does attempt to keep only one LLM server running to conserve memory! It will warn the user in case both servers are online and will recommend the one not in use by LARS be shut unless you’re certain it’s safe and required to keep both up. B) While HF-Waitress initially took a while to come online, it did eventually come up and likely was just taking a while to download the required LLM. This is indicated as the likely cause in the error too. I will be refining the server to make its internal reporting and health checking more powerful and capable with future updates so such detailed status can be verified. C) Document chunking error: this I’m concerned about as it most likely indicates an empty document was attempting to be embedded into the vectorDB. Most likely, either an OCR API endpoint was used without the correct key, or a scanned document was loaded using plain text extraction, as in either case a blank text file would be generated which might then lead to such an error. I’ll make sure to add a warning for such scenarios in the future so the user may be made aware of such a problem. Do share more details about this if the problem persists. If you resolved it or know the cause basis the document that was being loaded, do share details here for clarity.
llama.log: This is simply generated by llama.cpp and indicates no issues.

Thanks for sharing these logs, do update further on the document loading issue!

abgulati / LARS

Typo in requirements.txt #22