cvlab-columbia / viper

Code for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
Other
1.63k stars 117 forks source link

Compilation issues after running setup.py inside GLIP directory #19

Open paolaos opened 1 year ago

paolaos commented 1 year ago

Hello,

I followed the instructions in the README in order to install and run viper. After I do cd GLIP and run python setup.py clean --all build develop --user, I get a plenty of DeprecatedTypeProperties& at::Tensor::type() errors, which generally show this message:

 warning: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). 

It does finish compiling with warnings, and when I try importing main_simple_lib in main_simple.ipynb I get (as expected) an error:

Loading BLIP...
2023-04-24 12:42:26.690282: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[libprotobuf FATAL google/protobuf/stubs/common.cc:83] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.6).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.6).  Contact the program author for an update.  If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library.  (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
Aborted

I believe the problem could be related to an incompatibility between libraries, but I made sure to install everything from requirements.txt and have an active conda environment, so at this point I do not really know what else it could be. Does anyone know why I could be getting this issue? Thanks in advance 🥲

surisdi commented 1 year ago

Hi Paola,

The installation of GLIP worked properly, the message that you get is just a warning, you can safely ignore it.

Regarding the error with protobuf, is it possible that you had a previous installation of tensorflow in the environment? Or you created the environment from scratch? What tensorflow version do you see if you run pip show tensorflow?

Thanks

paolaos commented 1 year ago

Hi, thanks for your response!

Without the environment, I have no installation of tensorflow. When I activate the conda environment and run pip show tensorflow, it seems to appear that I have tensorflow 2.11.1. Is this how it should be?

surisdi commented 1 year ago

Yes, that is correct. Can you try this solution? Uninstall tensorflow and install it again.

paolaos commented 1 year ago

I did that, but I am still getting the same error when I try to do from main_simple_lib import * 😢

surisdi commented 1 year ago

Can you try installing the version of protobuf that the error says? pip install protobuf==3.9.2

paolaos commented 1 year ago

It got installed with some dependency errors:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
wandb 0.13.9 requires protobuf!=4.21.0,<5,>=3.19.0; python_version > "3.9" and sys_platform == "linux", but you have protobuf 3.9.2 which is incompatible.

Successfully installed protobuf-3.9.2

And now when I try running from main_simple_lib import * I get an AttributeError from Wanda (which makes sense). Example:

    from wandb.proto.v3.wandb_base_pb2 import *
  File "/opt/conda/envs/vipergpt/lib/python3.10/site-packages/wandb/proto/v3/wandb_base_pb2.py", line 21, in <module>
    __RECORDINFO = DESCRIPTOR.message_types_by_name['_RecordInfo']
AttributeError: 'NoneType' object has no attribute 'message_types_by_name'

I will try changing the version of wandb to one that is compatible with the rest of the libraries.

paolaos commented 1 year ago

Hey, just an update :) I reinstalled wandb and after running the library import python I got a different error (below), so I added this flag export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python based on the suggestion:

RuntimeError: Failed to import transformers.models.blip.modeling_blip because of the following error (look up to see its traceback):
Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

I then recompiled and re ran the library import on python and it now loads BLIP but shows different errors:

Loading BLIP...

2023-04-28 14:42:58.012168: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-28 14:42:59.252120: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-04-28 14:42:59.252267: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2023-04-28 14:42:59.252287: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

BLIP loaded 

Thank you for your patience so far! It's never pleasant to debug library incompatibility issues.

surisdi commented 1 year ago

I will need some time to try to reproduce this, but for now it looks like you only have warnings, and not errors. Does it work for you now?

paolaos commented 1 year ago

So far, it does not. When I run the functions from main_simple_lib I get no errors but I also believe it is not doing anything. get_code, for example, is not returning anything. I will keep digging while you find the time to reproduce the error :)

surisdi commented 1 year ago

If you are using get_code with Codex (code-davinci-002), it will not work because Codex is deprecated by OpenAI. See this comment and this thread for alternatives.

iisxuwei commented 9 months ago

@paolaos Hi, Did you continue to solve this problem and use it later? I recently encountered the same problem in the configuration environment that can't be solved. Can you let me know if you have a good solution? Thanks in advance.

surisdi commented 9 months ago

Hi @iisxuwei, from the previous error/warning messages, which ones are you getting?

paolaos commented 9 months ago

Hi @iisxuwei @surisdi, apologies I forgot to respond. I moved on with another project because it was taking too much time and I only had a few days to test viper. Would be keen to know if you're able to make it work, though.

iisxuwei commented 9 months ago

Sorry for the late reply. I encountered several of the same problems as paolaos, but I solved them by ajdust the version of bitsandbytes/protobuf, and install Rust. Anyway, i have solved yet. But i'm not sure the install of GLIP is correctly. When i use the instruction python setup.py clean --all build develop --user. It doesn't seem to output the correct result, but a warning. cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++.

/root/miniconda3/lib/python3.8/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
/root/miniconda3/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running egg_info
writing maskrcnn_benchmark.egg-info/PKG-INFO
writing dependency_links to maskrcnn_benchmark.egg-info/dependency_links.txt
writing top-level names to maskrcnn_benchmark.egg-info/top_level.txt
reading manifest file 'maskrcnn_benchmark.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'maskrcnn_benchmark.egg-info/SOURCES.txt'
running build_ext
copying build/lib.linux-x86_64-cpython-38/maskrcnn_benchmark/_C.cpython-38-x86_64-linux-gnu.so -> maskrcnn_benchmark
Creating /root/.local/lib/python3.8/site-packages/maskrcnn-benchmark.egg-link (link to .)
Adding maskrcnn-benchmark 0.0.0 to easy-install.pth file

Installed /root/autodl-tmp/viper-main/GLIP
Processing dependencies for maskrcnn-benchmark==0.0.0
Finished processing dependencies for maskrcnn-benchmark==0.0.0`

And I recompiled and ran the library import on python and it now loads the following error:

Loading checkpoint shards: 100%
2/2 [00:20<00:00, 9.69s/it]
Using cache found in /root/.cache/torch/hub/intel-isl_MiDaS_master
Using cache found in /root/.cache/torch/hub/intel-isl_MiDaS_master
VISION BACKBONE USE GRADIENT CHECKPOINTING:  False

(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/tokenizer_config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f61e05a6380>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: b06bc47d-e39d-4628-bd98-a841077af191)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/tokenizer_config.json
(MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /bert-base-uncased/resolve/main/config.json (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f61e05a5960>, 'Connection to huggingface.co timed out. (connect timeout=10)'))"), '(Request ID: 15daa7c4-e7d8-404c-80c4-3be902613bb8)')' thrown while requesting HEAD https://huggingface.co/bert-base-uncased/resolve/main/config.json

From the results, the config file can't be found in the cached file and i can't connect to 'huggingface.co' in this step for no reason. So i tried to use the offline-mode but i cant find the path to the 'bert-base-uncased'. I'm wondering whether the pytorch version 1.13 is not suitable for maskrcnn.

surisdi commented 6 months ago

Is the error related to maskrcnn? It looks like it may be related to the MiDaS model. May it be related to this issue?: https://github.com/huggingface/transformers/issues/17611#issuecomment-1323272726

adas598 commented 1 month ago

Hello,

When I run python setup.py clean --all build develop --user I get the following error. I am not sure if this is an issue with the python project or with my NVIDIA drivers. Any insight would be deeply appreciated.

/usr/bin/nvcc -DWITH_CUDA -I/home/das038/Documents/Mine/VIPERGPT/GLIP/maskrcnn_benchmark/csrc -I/home/das038/anaconda3/envs/vipergpt/lib/python3.10/site-packages/torch/include -I/home/das038/anaconda3/envs/vipergpt/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/das038/anaconda3/envs/vipergpt/lib/python3.10/site-packages/torch/include/TH -I/home/das038/anaconda3/envs/vipergpt/lib/python3.10/site-packages/torch/include/THC -I/home/das038/anaconda3/envs/vipergpt/include/python3.10 -c /home/das038/Documents/Mine/VIPERGPT/GLIP/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.cu -o build/temp.linux-x86_64-cpython-310/home/das038/Documents/Mine/VIPERGPT/GLIP/maskrcnn_benchmark/csrc/cuda/ROIAlign_cuda.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
error: command '/usr/bin/nvcc' failed with exit code 1

Thanks