How to use Semantra with GPU via PyTorch

2ool commented 1 year ago

Hi, Thanks for the great work and wonderful tool, and my apologies for asking 2 different questions in one thread. I was following the guides on how to use Semantra and got it running, but I couldn't find guide on how to utilize my GPU (RTX 3070) instead of my old CPU. I have followed the link to get PyTorch running, but still Semantra is only utilizing my CPU.

Is there some special command/code or model should be used to utilize GPU? it will be very nice to have guide section for this, or if you can kindly share how to utilize the GPU for speeding up Semantra.

I have also notice that it takes some disk space (~10 GB) after running quite few pdfs and text tests, and was trying to figure out if this space was taken for cache embeddings and/or for trying different models. So far I couldn't find all locations related to Semantra.

%AppData%\Semantra  (~444 MB)
%UserProfile%\.cache    (~5.92 GB)
%UserProfile%\.local      (~1.1 GB)

Which location is safe to clear for old searches, if I want to free some space (I am using Semantra on windows 10)?

I really appreciate your help. Thanks

freedmand commented 1 year ago

These are some good questions and definitely worthy of a write-up when I find the time.

The GPU will automatically be used if torch can find it. How that works is a little unclear, even to me, especially with regard to installing Semantra via pipx and knowing which torch version is used. You can try running Python in the shell after pip install'ing pytorch and running import torch; print(torch.cuda.is_available()) to see if GPU support is enabled (will print True if it is). You may have to install a version of pytorch that has CUDA or install the right drivers. I think an approach like the one in this link worked for me in Ubuntu https://stackoverflow.com/a/71159448 but you may have to dig around to find out how to do it in Windows. When print(torch.cuda.is_available()) shows True you'll know it should work.
The breakdown of what disk space is being used looks like:
- %AppData%\Semantra: this is where Semantra caches embeddings and some additional files that map positions in PDF files, e.g. character offsets per page. If you delete files here, Semantra will recalculate this information on files you may have processed previously instead of using the cached version
- %UserProfile%.cache and %UserProfile%.local: these are used by Huggingface's transformers library globally. Any models you may have downloaded from Huggingface, including ones Semantra uses, are placed here. If you delete files here, you'll have to redownload models that may have been previously cached.
To sum it up, you can safely delete things but it will just mean you may lose cached models/embeddings.

endolith commented 1 year ago

Is it possible to add the GPU version of pytorch to the requirements installed by pipx? I have a CUDA-enabled NVIDIA GPU but it wasn't being used when I ran semantra in Windows. Would I install them with pipx so they aren't contaminating the main environment?

endolith commented 1 year ago

I tried to follow the official instructions from https://pytorch.org/get-started/locally/ but adapted to pipx:

pipx runpip semantra install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

But it doesn't work. The command

semantra --port 1234 --transformer-model sentence-transformers/paraphrase-multilingual-mpnet-base-v2 "example.pdf"

uses CPU only, there's no GPU/CUDA usage at all. The right libraries are now installed in the environment, though?

λ pipx runpip semantra list
Package            Version
------------------ ------------
aiohttp            3.8.4
aiosignal          1.3.1
annoy-fixed        1.16.3
async-timeout      4.0.2
attrs              23.1.0
blinker            1.6.2
certifi            2023.5.7
charset-normalizer 3.1.0
click              8.1.3
colorama           0.4.6
filelock           3.12.2
Flask              2.3.2
frozenlist         1.3.3
fsspec             2023.6.0
huggingface-hub    0.16.2
idna               3.4
itsdangerous       2.1.2
Jinja2             3.1.2
MarkupSafe         2.1.3
mpmath             1.3.0
multidict          6.0.4
networkx           3.1
numpy              1.25.0
openai             0.27.8
packaging          23.1
Pillow             10.0.0
pip                23.1.2
pypdfium2          4.18.0
python-dotenv      1.0.0
PyYAML             6.0
regex              2023.6.3
requests           2.31.0
safetensors        0.3.1
semantra           0.1.6
setuptools         68.0.0
sympy              1.12
tiktoken           0.4.0
tokenizers         0.13.3
torch              2.0.1
torchaudio         2.0.2+cu117
torchvision        0.15.2+cu117
tqdm               4.65.0
transformers       4.30.2
typing_extensions  4.7.1
urllib3            2.0.3
Werkzeug           2.3.6
wheel              0.40.0
yarl               1.9.2

I don't know how you managed to create a semantra.exe file from a python environment though. Does it use the python libraries from the environment or does it have them baked into an executable permanently somehow?

endolith commented 1 year ago

C:\Users\endolith\.local\pipx\venvs\semantra\Scripts\python.exe
Python 3.10.11 | packaged by Anaconda, Inc. | (main, Apr 20 2023, 18:56:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.cuda.is_available())
False

endolith commented 1 year ago

Is this accurate? In the past I've tried downloading CUDA toolkit from the website and it was an ordeal and didn't work, so I'm hesitant to try that again.

ChatGPT

I understand your confusion. The pipx runpip semantra install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 command installs the PyTorch library with CUDA support, but it doesn't install the CUDA toolkit itself. The CUDA toolkit is a separate software provided by NVIDIA that includes the CUDA runtime and the nvcc compiler, among other things. PyTorch uses the CUDA toolkit to run computations on the GPU.

Here's a brief explanation of the different components involved:

CUDA toolkit: This is a software development kit provided by NVIDIA. It includes the CUDA runtime, libraries, and development tools like the nvcc compiler. The CUDA toolkit is needed to compile and run CUDA code.
PyTorch with CUDA support: This is a version of the PyTorch library that's been compiled with support for CUDA. This means that PyTorch can use the CUDA runtime to run computations on the GPU. However, PyTorch doesn't include the CUDA toolkit itself. The CUDA toolkit needs to be installed separately.
nvcc: This is the CUDA compiler, which is part of the CUDA toolkit. The nvcc compiler is used to compile CUDA code into GPU-executable code.

So, to use PyTorch with CUDA, you need to do two things:

Install the CUDA toolkit on your system. This can be done by downloading the CUDA toolkit from the NVIDIA website and running the installer.
Install PyTorch with CUDA support. This can be done with the pipx runpip semantra install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117 command, as you've done.

Once you've done both of these things, PyTorch should be able to use CUDA to run computations on the GPU.

You

I don't think this is necessary when installing pytorch with conda? conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

ChatGPT

You're correct. When you install PyTorch with Conda using the command you provided, Conda will automatically install the correct version of the CUDA toolkit as a dependency. This is one of the advantages of using Conda: it manages dependencies for you, including non-Python dependencies like the CUDA toolkit.

However, pip and pipx don't manage non-Python dependencies. When you install PyTorch with pip or pipx, you need to install the CUDA toolkit separately. This is why you're seeing the 'nvcc' is not recognized error: the CUDA toolkit is not installed on your system.

If you prefer to use Conda to manage your Python environments and dependencies, you could create a new Conda environment, install PyTorch in that environment using the command you provided, and then install semantra in the same environment. This way, semantra and PyTorch would both have access to the same CUDA toolkit installation.

Here's how you can do it:

Create a new Conda environment:

conda create --name semantra_env

Activate the new environment:

conda activate semantra_env

Install PyTorch with CUDA support:

conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

Install semantra:

pip install semantra

Now, you should be able to run semantra with GPU support in the semantra_env Conda environment.

freedmand commented 1 year ago

You could try cloning Semantra, editing the requirements to include the GPU source you need, and then pip install -e . to install Semantra locally from the repo. Then run the torch.cuda.is_available() trick to see if it works.

I think on the whole though I need a more plug-and-play approach. Will target for v0.2.

ne1188 commented 1 year ago

I tried to use pip from within the pipx virtual environment, but it wasn't working. Then I saw "permission denied" errors and I realized it was resolving to my system pip and not one in the pipx virtual env.

It only worked with below command:

C:\Users\%USERNAME%\.local\pipx\venvs\semantra\Scripts\activate 
C:\Users\%USERNAME%\.local\pipx\venvs\semantra\Scripts\python.exe -m pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117

After running that, semantra now uses the gpu instead of cpu.

I think switching to python -m venv instead of pipx would prevent this issue in the future. In a python venv, you wouldn't have this issue of pip not resolving correctly.

freedmand / semantra

How to use Semantra with GPU via PyTorch #36

ChatGPT

You

ChatGPT