Error when running generate: "ImportError: libGL.so.1: cannot open shared object file: No such file or directory"

neilknowscomputers commented 6 months ago

I ran example.sh and got this error on the generate step

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ubuntu/tile2net/src/tile2net/__main__.py", line 4, in <module>
    from tile2net.tileseg.inference.inference import inference
  File "/home/ubuntu/tile2net/src/tile2net/tileseg/inference/inference.py", line 51, in <module>
    import tile2net.tileseg.network.ocrnet
  File "/home/ubuntu/tile2net/src/tile2net/tileseg/network/ocrnet.py", line 39, in <module>
    from tile2net.tileseg.utils.misc import fmt_scale
  File "/home/ubuntu/tile2net/src/tile2net/tileseg/utils/misc.py", line 41, in <module>
    import cv2
  File "/home/ubuntu/miniconda3/envs/testenv/lib/python3.11/site-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/home/ubuntu/miniconda3/envs/testenv/lib/python3.11/site-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/testenv/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

OpenCV has been installed via conda install opencv but I'm certain that is the heart of this error.

Please help! Thanks!

Mary-h86 commented 6 months ago

Did you follow the instructions, create a new environment and installed the requirements using pip? only cv2 was installed using conda ?

neilknowscomputers commented 6 months ago

I did, yes. I ran

conda create --name testenv python=3.11
conda activate testenv
python -m pip install -e .

per the readme. (I didn't notice that opencv-python was in the requirements.txt the first time)

then I ran

python -m tile2net generate -l '42.35555189953313, -71.07168915322092, 42.35364837213307, -71.06437423368418' -o /home/ubuntu/tile2net/poc -n example

I tried these steps again just now to be sure And got the above error.

python --version says Python 3.11.8

neilknowscomputers commented 6 months ago

One thing I noticed while troubleshooting, is it seemed that pytorch was not working with cuda properly.

Specifically...

Python 3.11.8 (main, Feb 26 2024, 21:39:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>> torch.cuda.device_count()
/home/ubuntu/miniconda3/envs/testenv/lib/python3.11/site-packages/torch/cuda/__init__.py:628: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
0

Perhaps because

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

While this is certainly an issue, its not clear to me if its the cause of this particular issue. What do you think?

This was on an EC2 p2 I just tried on a g3 with a pre-configured deep learning OS image and it ran until it hit a segmentation fault! 😄

So the issue was clearly something to do with my setup but I'm not sure if it was hardware or config?

Mary-h86 commented 6 months ago

I am unable to recreate the issue. I did clone the repo and a clean env. installation, run both the example.sh and the command you sent. Runs without any problems. However, I am not entirely clear on your process. You mentioned you installed cv2 using conda. Why did you need to install it using that? Did you encounter any problems and had to reinstall? it should get installed through the package installation with python -m pip install -e .

The issue does not seem to be caused by Tile2Net, did you try resolving the specific error? seems like this can solve the cv2 issue

sudo apt-get update 
sudo apt-get install libgl1-mesa-glx

Unfortunately I don't have any experience working with Amazon instances, but from the logs you sent, it is most probably a mismatch or misconfiguration between PyTorch, the CUDA version installed, and the NVIDIA driver on the EC2 instance.

neilknowscomputers commented 6 months ago

I did install with pip, but I also installed opencv separately when I was troubleshooting. I mistakenly thought it wasn't included in the requirements file but now I see that it is. That extra step I took was just redundant.

Thank you for your help. I'll open a new issue for the segmentation fault I'm currently experiencing if that's ok.

Mary-h86 commented 6 months ago

mixing conda and pip in installing the same package is in most cases problematic. My suggestion is to create a new environment and only install the requirements and test again. Feel free to open a new issue for the segmentation problems you are facing, however, I cannot help much if it is related to the AWS instances.

VIDA-NYU / tile2net

Error when running generate: "ImportError: libGL.so.1: cannot open shared object file: No such file or directory" #58