[WinError -529697949] Windows Error 0xe06d7363

giowesome commented 5 years ago

Hi, I believe I am encountering a memory error, however what puzzles me the most is its erratic behavior. I have been able to run the following code several times (and by trial and error found the right setting for max_mem_size) but apparently now it's not working anymore.

The total memory size of X_trainand y_train is 3.05gb. I have 64gb RAM and working on an i9-9900X (10 physical cores) with 2x RTX 2080Ti.

This are the two lines I am trying to run:

%%time
#this doesn't crash setting max_mem_size=50000 and is very fast. crashes with max_mem_size=60000 Doesn't seem to use GPU
model = SVC(random_state=2,n_jobs=-1,gpu_id=0,max_mem_size=50000)
model.fit(X_train,y_train)

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<timed exec> in <module>

~\Anaconda3\lib\site-packages\thundersvm\thundersvmScikit.py in fit(self, X, y)
    111         if self.max_mem_size != -1:
    112             thundersvm.set_memory_size(c_void_p(self.model), self.max_mem_size)
--> 113         fit(X, y, solver_type, kernel)
    114         if self._train_succeed[0] == -1:
    115             print ("Training failed!")

~\Anaconda3\lib\site-packages\thundersvm\thundersvmScikit.py in _dense_fit(self, X, y, solver_type, kernel)
    188             self.verbose, self.max_iter, self.n_jobs, self.max_mem_size,
    189             self.gpu_id,
--> 190             n_features, n_classes, self._train_succeed, c_void_p(self.model))
    191         self.n_features = n_features[0]
    192         self.n_classes = n_classes[0]

OSError: [WinError -529697949] Windows Error 0xe06d7363

Another detail I noticed, before I was getting this error quite quickly, while now it crashes after about 30 minutes. Related: #115

zeyiwen commented 5 years ago

Thanks for the feedback! We will look into it and get back to you.

QinbinLi commented 5 years ago

Hi, @giowesome

How did you compile the library? Did you use the GPU version? The unit for max_mem_size is MB. If you use GPU then the memory you set (about 50G) is much bigger than your GPU memory.

giowesome commented 5 years ago

Well thought, I actually built the GPU version just yesterday (the whole cmake procedure and vs files) so I believe I tested initially on CPU (when I found the 50gb was the memory limit) and then I moved to GPU and didn’t update the memory size limit. I’m away next week. I’ll change the setting and test the solution once back.

Sent from my iPhone

On 20 Jul 2019, at 03.38, Li Qinbin notifications@github.com wrote:

Hi, @giowesome

How did you compile the library? Did you use the GPU version? The unit for max_mem_size is MB. If you use GPU then the memory you set (about 50G) is much bigger than your GPU memory.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

giowesome commented 5 years ago

Hi @GODqinbin, @zeyiwen I tried to reduce the max_mem_size (to 10.000) as suggested but unfortunately I still get the same error. I also tried to reduce number of cores used but that didn't help either.

The model seems to be using only one CPU (no matter what I give to the n_jobs parameter). All the cores except one are idling.
I am not really sure this is the right way to monitor it but it also seems that the GPU is not being used (I am checking from the "Performance" tab in task manager)
As previously mentioned the crash happens about 30 minutes within the run while when successfully tested the whole fit was done in less than 10 minutes.

EDIT: I just realized I never answered to your first question. How I compiled the library.

I initially just installed the python wheel [following this link].(https://www.comp.nus.edu.sg/~wenzy/pip-pack/svm/thundersvm-cu10-0.2.0-py3-none-win_amd64.whl).

This was working but it seemed to me that the GPU was not used at all so I followed this. Particularly, these are the steps I took:

git clone https://github.com/zeyiwen/thundersvm.

cd thundersvm
mkdir build
cd build
cmake ..  -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=TRUE -DBUILD_SHARED_LIBS=TRUE -G "Visual Studio 16 2019"

in the builddirectory I opened ALL_BUILD.vcxproj

right click on Solution 'thundersvm' (5 of 5 projects) and chose Build solution

giowesome commented 5 years ago

Hi @GODqinbin, @zeyiwen. So I removed the pip package (pip uninstall thundersvm) and cleaned the build from visual studio. Then I reinstalled only the python wheel and ran some test.
What I found out until now is that thundersvm is able to use the GPU but still shows an erratic behavior.

Specifically, what I tried: running:

%%time
model = SVC(gpu_id=0,max_mem_size=10000)
model.fit(X_train[:1000,:10],y_train[:1000])

Led to the following different cases:

The code runs in less than 7 seconds. The GPU is correctly used as the Cuda graph in Task manager rapidly increases (up to 100%) and then decreases again to zero.
The code runs endlessly (had it run for more than 50 minutes), only one out of 20 cores is used (and maxed out) the RAM usage is fairly constant and the GPU is being used with Task Manager showing the Cuda oscillating continuously between 100% and 71-72%.
The code doesn't start (or so it seems) and it instantly returns the error reported in my first post.

This also makes me believe that when trying to run the full dataset the code stops before even getting to use the GPU.

I tested this both in Jupyter and Spyder.

QinbinLi commented 5 years ago

Hi, @giowesome

Sorry for our late reply. It is strange that the codes run into different cases using the same instructions.

Can you share the data set so that we can reproduce the case? We'll test it as soon as possble.
Can you remove the pip package and build from the visual studio? Then you can try the same instructions under the "thundersvm/python" directory. The python wheel may not be updated to the newest version timely.

Thanks!

giowesome commented 5 years ago

Hi @GODqinbin, Unfortunately I can't share the dataset but I'll try to replicate it and if the error persist I will share the synthetic replica.

I will try solution 2. now but could you please expand a little (or provide a link with an explanation) on how to run the instructions if the package is not installed through pip? I never did it so I am not sure how to proceed.

Thanks, G.

QinbinLi commented 5 years ago

Hi @giowesome

For example, you can create a python file under "thundersvm/python" directory and run it in Windows command line. More specifically, you can refer to the doc. There is an example under the "Scikit-learn wrapper interface" section. Thanks.

giowesome commented 5 years ago

Hi @GODqinbin , I removed the pip installation and now I am only running from the vs built.

I replicated the dataset as follows and the problem still persists.

import sys
import time
import pandas as pd
import numpy as np

Xsynth1=pd.DataFrame(np.random.randint(0,2,size=(6136780,171)).astype('uint8'))
Xsynth2=pd.DataFrame(np.random.randint(0,2,size=(6136780,14)).astype('bool'))
Xsynth3=pd.DataFrame(np.random.uniform(0,2,size=(6136780,36)).astype('float32'))
Xsynth4=pd.DataFrame(np.random.randint(0,50,size=(6136780,19)).astype('int64'))
Xsynth=pd.concat([Xsynth1,Xsynth2,Xsynth3,Xsynth4],axis=1)
#Xsynth.index=X_train.index
ysynth=pd.Series(np.random.randint(0,2,size=(6136780)).astype('int64')) #,index=X_train.index)

import gc
gc.enable()
del Xsynth1,Xsynth2,Xsynth3,Xsynth4
gc.collect()

""" the next two lines return, respectively: # 3.0 and 0.1 to me.
They will be 2.95 and 0.05 in your case as your index is different.
I believe this won't make any difference in replicating the issue.
"""
round(sys.getsizeof(Xsynth) / 1e9, 2) 
round(sys.getsizeof(ysynth) / 1e9, 2) 

start=time.time()

model = SVC(random_state=50,n_jobs=2,gpu_id=1,max_mem_size=11000) 
try:
    model.fit(Xsynth,ysynth) 
    end=time.time()
    print("Completed in %f min" % ((end-start)/60))
except Exception as e:
    end=time.time()
    print(e)
    print("Failed in %f min" % ((end-start)/60))

I also tried to use only one core or reduce the max_mem_size all the way down to 4000 (as well as setting it all the way up to 50000). Nothing worked as of now. According to the number of cores used and memory size selected the process takes different amount of time to crash but, always, it crashes without engaging the GPU. The only thing that goes up is the RAM consumption.

zeyiwen commented 5 years ago

Thanks for that. We will investigate this later. We are currently working on paper deadlines.

Xtra-Computing / thundersvm

[WinError -529697949] Windows Error 0xe06d7363 #159