SeldonIO / alibi-detect

Algorithms for outlier, adversarial and drift detection
https://docs.seldon.io/projects/alibi-detect/en/stable/
Other
2.21k stars 220 forks source link

How can I set batch_size to reference input data in MMDDrift? I got cuda out of memory. #597

Open KevinRyu opened 2 years ago

KevinRyu commented 2 years ago

Hello,

I have to get prediction from MMDDrift with big size dataset. (sampling is iterated from 1000 to 1000000. Each size is 1000, 1000, 10000, 500000 and 1000000) I am using RTX3080 10GB and GPU memory is not enough to allocate like the following error.

RuntimeError: CUDA out of memory. Tried to allocate 2.98 GiB (GPU 0; 10.00 GiB total capacity; 6.71 GiB already allocated; 1.39 GiB free; 6.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried memory growth setting in tensorflow and empty_cache() in pytorch. But it was not helpful.

Is there any methods or trick? I want to set batch_size setting in MMDDrift function, but I don't know this is possible and how to do.

arnaudvl commented 2 years ago

Hi @KevinRyu , we just added a new backend to the MMDDrift detector (KeOps) which hopefully addresses your issue. We haven't released a new Alibi Detect version with it yet so you would have to install from master: pip install git+https://github.com/SeldonIO/alibi-detect.git . Please check the docs and example for more detail and let me know if this helps. Alternatively, you could try the learned kernel MMD detector which enables batched kernel matrix computation via the batch_size kwarg. Additionally, we can look into adding a batched version of the PyTorch and TensorFlow MMD detectors

KevinRyu commented 2 years ago

Hi @KevinRyu , we just added a new backend to the MMDDrift detector (KeOps) which hopefully addresses your issue. We haven't released a new Alibi Detect version with it yet so you would have to install from master: pip install git+https://github.com/SeldonIO/alibi-detect.git. Please check the docs and example for more detail and let me know if this helps. Alternatively, you could try the learned kernel MMD detector which enables batched kernel matrix computation via the batch_size kwarg. Additionally, we can look into adding a batched version of the PyTorch and TensorFlow MMD detectors

Hi, arnaudvl

Thanks you very much. By the way, when I try to install from git+https://github.com/SeldonIO/alibi-detect.git, I got an error about UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 7112: illegal multibyte sequence. I want to try this method.. Is there any tips?

ascillitoe commented 2 years ago

@KevinRyu are you using Windows by any chance? In any case, please could you share the full error traceback?

For example:

base) E:\peakdet>python setup.py install
Traceback (most recent call last):
  File "setup.py", line 42, in <module>
    main()
  File "setup.py", line 19, in main
    ldict['LONG_DESCRIPTION'] = src.read()
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 4006: illegal multibyte sequence

I suspect this issue is coming from a character in a README somewhere that Windows doesn't like (similar to this https://github.com/physiopy/peakdet/issues/21). The question is whether the issue is with our README.md or one of our depedencies' files.

KevinRyu commented 2 years ago

Hi,

Yes. I am using windows10 in my office. And, I'll try to test the same task in ubuntu environment in my home. Because I just left the office now, I can share the full error message tomorrow morning(in my country).

I will share my test result. Thank you.

2022년 8월 23일 (화) 오후 7:15, Ashley Scillitoe @.***>님이 작성:

@KevinRyu https://github.com/KevinRyu are you using Windows by any chance? In any case, please could you share the full error traceback?

For example:

base) E:\peakdet>python setup.py install Traceback (most recent call last): File "setup.py", line 42, in main() File "setup.py", line 19, in main ldict['LONG_DESCRIPTION'] = src.read() UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 4006: illegal multibyte sequence

I suspect this issue is coming from a character in a README somewhere that Windows doesn't like (similar to this physiopy/peakdet#21 https://github.com/physiopy/peakdet/issues/21). The question is whether the issue is with our README.md or one of our depedencies' files.

— Reply to this email directly, view it on GitHub https://github.com/SeldonIO/alibi-detect/issues/597#issuecomment-1223860481, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG43MBKW2D5CWT7MMMEW7TDV2SQENANCNFSM57KHXJUA . You are receiving this because you were mentioned.Message ID: @.***>

KevinRyu commented 2 years ago

Hi, Yes. I am using windows10 in my office. And, I'll try to test the same task in ubuntu environment in my home. Because I just left the office now, I can share the full error message tomorrow morning(in my country). I will share my test result. Thank you. 2022년 8월 23일 (화) 오후 7:15, Ashley Scillitoe @.>님이 작성: @KevinRyu https://github.com/KevinRyu are you using Windows by any chance? In any case, please could you share the full error traceback? For example: base) E:\peakdet>python setup.py install Traceback (most recent call last): File "setup.py", line 42, in main() File "setup.py", line 19, in main ldict['LONG_DESCRIPTION'] = src.read() UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 4006: illegal multibyte sequence I suspect this issue is coming from a character in a README somewhere that Windows doesn't like (similar to this physiopy/peakdet#21 <physiopy/peakdet#21>). The question is whether the issue is with our README.md or one of our depedencies' files. — Reply to this email directly, view it on GitHub <#597 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG43MBKW2D5CWT7MMMEW7TDV2SQENANCNFSM57KHXJUA . You are receiving this because you were mentioned.Message ID: @.>

@KevinRyu are you using Windows by any chance? In any case, please could you share the full error traceback?

For example:

base) E:\peakdet>python setup.py install
Traceback (most recent call last):
  File "setup.py", line 42, in <module>
    main()
  File "setup.py", line 19, in main
    ldict['LONG_DESCRIPTION'] = src.read()
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 4006: illegal multibyte sequence

I suspect this issue is coming from a character in a README somewhere that Windows doesn't like (similar to this physiopy/peakdet#21). The question is whether the issue is with our README.md or one of our depedencies' files.

KevinRyu commented 2 years ago

In my ubuntu os, there are no UnicodeError. Anyway, I could install alibi-detect developer version and alibi-detect[keops] without any error. But I got an another error when I run python codes.

    <stdin>:1:10: fatal error: cuda.h: No such file or directory compilation terminated.
~/miniconda3/envs/alibidet/lib/python3.8/site-packages/alibi_detect/__init__.py in <module>
----> 1 from . import ad, cd, models, od, utils, saving
      2 from .version import __version__  # noqa F401
      3 
      4 __all__ = ["ad", "cd", "models", "od", "utils", "saving"]

~/miniconda3/envs/alibidet/lib/python3.8/site-packages/alibi_detect/cd/__init__.py in <module>
----> 1 from .chisquare import ChiSquareDrift
      2 from .classifier import ClassifierDrift
      3 from .ks import KSDrift
      4 from .learned_kernel import LearnedKernelDrift

---> 28     raise ValueError(message)
     29 
     30 

ValueError: [KeOps] Error : Error compiling formula. (error at line 41 in file /home/retros/miniconda3/envs/alibidet/lib/python3.8/site-packages/keopscore/utils/misc_utils.py)

My RTX3080 TI GPU has no problem..

nvidia-smi

Tue Aug 23 21:20:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I guess cuda version mismatch.. Which version do I need to run mmd with keops? Otherwise, What can I do for this problem?

KevinRyu commented 2 years ago

@KevinRyu are you using Windows by any chance? In any case, please could you share the full error traceback?

For example:

base) E:\peakdet>python setup.py install
Traceback (most recent call last):
  File "setup.py", line 42, in <module>
    main()
  File "setup.py", line 19, in main
    ldict['LONG_DESCRIPTION'] = src.read()
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 4006: illegal multibyte sequence

I suspect this issue is coming from a character in a README somewhere that Windows doesn't like (similar to this physiopy/peakdet#21). The question is whether the issue is with our README.md or one of our depedencies' files.

Hi,

The following is UnicodeError in my windows 11 pro.

pip install git+https://github.com/SeldonIO/alibi-detect.git
Collecting git+https://github.com/SeldonIO/alibi-detect.git
  Cloning https://github.com/SeldonIO/alibi-detect.git to c:\users\xxxxx\appdata\local\temp\pip-req-build-tr10rpjx
  Running command git clone -q https://github.com/SeldonIO/alibi-detect.git 'C:\Users\xxxxx\AppData\Local\Temp\pip-req-build-tr10rpjx'
  Resolved https://github.com/SeldonIO/alibi-detect.git to commit 0bbe586ff4ccce76795d01f9cde7940b205ba18e
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\xxxxx\anaconda3\envs\alibidet\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\xxxxx\\AppData\\Local\\Temp\\pip-req-build-tr10rpjx\\setup.py'"'"'; __file__='"'"'C:\\Users\\xxxxx\\AppData\\Local\\Temp\\pip-req-build-tr10rpjx\\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\xxxxx\AppData\Local\Temp\pip-pip-egg-info-s7sjfm_b'
         cwd: C:\Users\xxxxx\AppData\Local\Temp\pip-req-build-tr10rpjx\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\xxxxx\AppData\Local\Temp\pip-req-build-tr10rpjx\setup.py", line 47, in <module>
        long_description=readme(),
      File "C:\Users\xxxxx\AppData\Local\Temp\pip-req-build-tr10rpjx\setup.py", line 6, in readme
        return f.read()
    UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 7112: illegal multibyte sequence
    ----------------------------------------
WARNING: Discarding git+https://github.com/SeldonIO/alibi-detect.git. Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
KevinRyu commented 2 years ago

Hi @KevinRyu , we just added a new backend to the MMDDrift detector (KeOps) which hopefully addresses your issue. We haven't released a new Alibi Detect version with it yet so you would have to install from master: pip install git+https://github.com/SeldonIO/alibi-detect.git. Please check the docs and example for more detail and let me know if this helps. Alternatively, you could try the learned kernel MMD detector which enables batched kernel matrix computation via the batch_size kwarg. Additionally, we can look into adding a batched version of the PyTorch and TensorFlow MMD detectors

I could run MMDDrift with keops backend. But.. CUDA_OUT_OF_MEMORY is still occurred. Can I ask options(something like memory refresh etc.) or how to set keops backend related to OOM? So, I will try to use LearnedKernelMMD instead of keops backend now.

Error:

 0%|          | 0/3 [00:00<?, ?it/s]
[KeOps] error: cuLaunchKernel(kernel, gridSize_x, gridSize_y, gridSize_z, blockSize_x, blockSize_y, blockSize_z, blockSize_x * dimY * sizeof(TYPE), NULL, kernel_params, 0) failed with error CUDA_ERROR_OUT_OF_MEMORY

  0%|          | 0/3 [00:00<?, ?it/s]
(10000, 1) (10000, 1)   ---> The shape of dataset

...
---> 42         self.launch_keops(
     43             self.params.tagHostDevice,
     44             self.params.dimy,

RuntimeError: [KeOps] Cuda error.
ascillitoe commented 2 years ago

Hi @KevinRyu, thank you for all this info, I will investigate in detail and get back to you.

p.s. I added code blocks into your comments just so I could read the error messages a little easier, hope you don't mind.

ascillitoe commented 2 years ago

Hi again @KevinRyu, to kick things off, for the UnicodeDecodeError on Windows, I suspect this is caused by the following in our README.md:

def readme():
    with open("README.md") as f:
        return f.read()

which is used to fill in the library's description on our PyPI page. Your traceback makes me think the open command is attempting to decode the README.md file with the cp949 codec (which I believe is for the Korean alphabet?). However, strangely I haven't been able to replicate your error by running:

PYTHONIOENCODING=cp949 pip install git+https://github.com/SeldonIO/alibi-detect.git

Nevertheless, please would you be able to try running the following? (notice I also added the keops bit on the end as I assumed you will also want that)

PYTHONIOENCODING=utf8 pip install git+https://github.com/SeldonIO/alibi-detect.git#egg=alibi-detect[keops]

I've provisionally opened an issue (https://github.com/SeldonIO/alibi-detect/issues/600) to enforce utf8 to be used here. We will do this if the above fixes the issue for you.

ascillitoe commented 2 years ago

For the next issue (your cuda.h one), did you fix this one? If not (and for future reference), I agree that it is probably a cuda version mismatch.

Our keops backend is built on pykeops, which requires a working CUDA toolkit installation. Two common issues are:

  1. A version mismatch between the cuda dev stack (shown by nvcc -V) and the cuda driver version (shown by nvidia-smi). From the debugging you've done it looks like you're already aware of this one.

  2. Everything can be installed properly, but the cuda header files such as cuda.h could be located in a non-standard location. Here it might just be a case of locating them and updating your CUDA_PATH env variable.

This issue https://github.com/getkeops/keops/issues/257 has some good tips for investigating cuda problems. The problem (and solution) can depend on your setup i.e. on a managed cluster or a personal machine etc, but to setup a clean isolated cuda enviroment I tend to have success using conda to install cudatoolkit-dev.

ascillitoe commented 2 years ago

Re your final CUDA_OUT_OF_MEMORY issue, would you be able to let us know the dimensions of your reference and test data, please? (the data you give to MMDDrift() and the data you give to predict()). Plus, can you let us know the args and kwargs you are giving to MMDDrift please? i.e. MMDDrift(x_ref, p_val=..., backend=...) etc.

KevinRyu commented 2 years ago

Re your final CUDA_OUT_OF_MEMORY issue, would you be able to let us know the dimensions of your reference and test data, please? (the data you give to MMDDrift() and the data you give to predict()). Plus, can you let us know the args and kwargs you are giving to MMDDrift please? i.e. MMDDrift(x_ref, p_val=..., backend=...) etc.

Hi, Thank you so much.

  1. UnicodeError : I solved this issue after changing windows 11 setting. I turned on option for "unicode utf-8 support to support global languages" in locale section in control panel. My ubuntu had no problem. As a result, I was able to install without unicode error. Anyway, I will test "PYTHONIOENCODING=cp949 pip install git+https://github.com/SeldonIO/alibi-detect.git" as you informed and share the result tomorrow. Thank you!

  2. Several errors related to keops Several critical and different errors were occurred in ubuntu and windows 11. Currently, I solved errors and was able to execute my drift detection code. I installed matched version of cuda toolkit. And fcntl.py file was not found. I write this code and saved into proper folder. And another errors were occurred but I don't remember all. Anyway, these errors are all related to pykeops.

    By the way, OOM is still not solved. I know my data size is too huge...so, I wanted to use batch mode. I don't remember correct size but I saw needed size in error message. It might be over 80GiB. And this size was different with each framework. (tensorflow and pytorch)

    I am using sampled datasets from sklean.datasets. Sampled size is [1000, 10000, 100000, 500000, 1000000]. By using for loop, each size of data was calculated in GPU as tensor type. In my RTX3080(10GB) and RTX3080TI(12GB), 10000 is maximum size to be allowed. Error is occurred if size is over 10000.

I read your comments carefully again when I have time and share missed information. Have a nice day!

arnaudvl commented 2 years ago

Hi @KevinRyu , just a quick note on the OOM issues. Basic PyTorch implementation of the MMD detector on your RTX3080Ti will likely get OOM from around 20k instances (so 20k in the reference and 20k in the test set, for a total of 40k). KeOps on the other hand might work fine on even 500k instances for reference/test sets (so 1mn in total combined) on the same GPU.

KevinRyu commented 2 years ago

Hi @KevinRyu , just a quick note on the OOM issues. Basic PyTorch implementation of the MMD detector on your RTX3080Ti will likely get OOM from around 20k instances (so 20k in the reference and 20k in the test set, for a total of 40k). KeOps on the other hand might work fine on even 500k instances for reference/test sets (so 1mn in total combined) on the same GPU.

Hi. I used memory growth option in tensorflow backend and cuda.empty_cache + gc in torch backend but I got OOM error if over 10000. 10k reference and 10k test dataset was limit. It's so strange. Anyway, Thank you for information about Keops's performance. But what can I do...? I will share the error message including reserved and needed memory size tomorrow. In my case, Keops shows the similar size memory errors compared to tensorflow or pytorch backend. Almost 10GB memory was reserved by torch and 2GB was left. That's not enough size for my datasets.

Can you advice what should I check?