SJ001 / AI-Feynman

MIT License
646 stars 193 forks source link

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first. #50

Open sirisian opened 3 years ago

sirisian commented 3 years ago

I just created a fresh Ubuntu 20.04.3 LTS install and installed drivers and checked that pytorch was using CUDA and everything seems fine.

It's in S_run_aifenman.py line 85: idx_min = np.argmin(np.array([symmetry_plus_result.......

This error occurs after all the brute force lines. I'm not familiar with numpy or pytorch, so hopefully this is an obvious error on my part? This is the command I used to get pytorch

pip3 install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio===0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Then I followed that notebook on the linked site on the README. I'm using my own data that is similar to their example. Did something perhaps change recently that would cause this error with the code? Do I need to use an old pytorch? (I have a 3090 that I'm using for reference since I believe I need to use CUDA 11.1 or higher).

sirisian commented 3 years ago

Had more time to test. The readme is wrong to reference the medium article that changes the head. I used the package and it worked fine.

One minor change is pytorch 1.9.1 throws a CUDA error: no kernal image available for execution on the device. Switched to 1.8.2 and it worked fine.

The gpu usage is really low though like 7% during training. Is that normal?

sirisian commented 3 years ago

I spoke too soon. It was training fine for hours then:

Complexity #  MDL Loss #  Expression
0.0 27.1 0.000000000000+x2

Training a NN on the data... 

NN loss:  (tensor(0.0002, device='cuda:0', grad_fn=<DivBackward0>), SimpleNet(
  (linear1): Linear(in_features=9, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=128, bias=True)
  (linear3): Linear(in_features=128, out_features=64, bias=True)
  (linear4): Linear(in_features=64, out_features=64, bias=True)
  (linear5): Linear(in_features=64, out_features=1, bias=True)
)) 

Checking for symmetries...

Checking for separabilities...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 274, in run_aifeynman
    PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 96, in run_AI_all
    idx_min = np.argmin(np.array([symmetry_plus_result[0], symmetry_minus_result[0], symmetry_multiply_result[0], symmetry_divide_result[0], separability_plus_result[0], separability_multiply_result[0]]))
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/tensor.py", line 621, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
sirisian commented 3 years ago

I fixed this, and I'm running a very long test to see if there's any other issues.

My fix was to go into the S_symmetry.py and S_separability.py and before the lines with return min_error... to put:

if is_cuda and isinstance(min_error, torch.Tensor):
    min_error = min_error.cpu()

When I'm back in Windows I'll do a pull request with the changes. I will say though I don't program python or torch, so does this fix look right? From what I can tell the min_error is on the GPU with CUDA and needs to be moved back before numpy can work with it. @SJ001 do you not get this issue when using CUDA? Seems strange that I'm the only one that sees this bug unless everyone else is using their CPUs?

sirisian commented 3 years ago

I let it run with my dataset and after hours it hangs forever at this point:

aifeynman.run_aifeynman("/home/.../Desktop/test/", "data.txt", 480, "7ops.txt", polyfit_deg=3, NN_epochs=10000)

....

Trying to solve mysteries with brute force...
Trying to solve results/mystery_world_tan/data.txt_train-translated_plus-gen_sym-translated_plus
Rejection threshold.....    10.000
Bit margin..............     0.000
Number of variables.....       6
Functions used..........               +*/>~R0
 Arity            0 : 0abcdef
 Arity            1 : >~R
 Arity            2 : +*/
Loading mystery data....
      506880  rows read from file mystery.dat                                                                                                                                                                                                                                                     
Number of examples......  506880
Removing problematically
   1587689.6939376462        1587.6897693488602     
      506713  out of       506880  data points discarded for being too close to zero
Shuffling mystery data..
 Searching for best fit...
     29.788690328314      0.002277975996                      a               2             1.0000          4975.7113            25.8922           226.0018             3.1685            84.0000
     29.779800338555      0.014233059519                     cR              25             4.6439          4977.8705            25.8918           225.9984             3.1907           147.0800
     29.775891131135      0.015494992578                     fR              28             4.8074          4977.3812            25.8928           226.0109             3.2050           149.2143
     29.773613671717      0.001402931063                    fa+              42             5.3923          4977.5858            25.8922           226.0033             3.2098           151.1905
     29.772385420114      0.001761270442                    fc+              56             5.8074          4977.7957            25.8915           225.9956             3.2115           155.1429
     29.764185792013      0.000078935191                    aa*              86             6.4263          4977.0453            25.8912           225.9903             3.2276           143.8372
     29.730417191761      0.032593591692                    ac/             149             7.2192          4972.1988            25.8942           226.0207             3.3104           136.9195
     29.710088371075     -0.026905270931                   fd~+             322             8.3309          4969.9157            25.8987           226.1149             3.3127           143.6118
     29.697570938432      0.016818717993                  df~>+            4933            12.2682          4971.7626            25.8961           226.0702             3.3687           133.6975
     29.690145638461     -0.018553622986                 eca~+/           23022            14.4907          4972.7450            25.8906           225.9850             3.4014           133.8753
     29.689609162648      0.001576072744                 eac/*>           36826            15.1684          4973.3332            25.8941           226.0200             3.4604           133.9103
     29.689072004598      0.326721197302                 fd>~/>           58506            15.8363          4973.9113            25.8965           226.0790             3.3872           132.4078
     29.688480787030      0.309902479310                 f~>d/>          141939            17.1149          4975.0912            25.8965           226.0795             3.3886           131.7140
     29.685064936527     -0.016661307256                fdfR+~+          247142            17.9150          4975.3208            25.8995           226.1077             3.4284           128.6157
     29.681217801006     -0.021084561074                eca~+/>          277416            18.0817          4974.8451            25.8892           225.9686             3.4085           128.4989
     29.647439507779      0.508116382065                df~>+a/          666496            19.3462          4970.4686            25.8992           226.1146             3.4624           126.0836
     29.595071261945     -0.034777319032               dbaf~/*+         1532305            20.5473          4962.9242            25.9012           226.2784             3.4518           126.8539
     29.573903513664     -0.029834165542              dbfa>~//+        16440576            23.9708          4962.8126            25.9018           226.2677             3.5392           123.2495
     29.573903513664     -0.029834165542              dba>f~/*+        26299866            24.6486          4963.4904            25.9018           226.2677             3.5392           122.2774
Checking polyfit 

Pareto frontier in the current branch:

Complexity #  MDL Loss #  Expression
0.0 27.09 0.000000000000+x5

Found pretrained NN 

^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 273, in run_aifeynman
    PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 179, in run_AI_all
    PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 241, in run_AI_all
    PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 179, in run_AI_all
    PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 71, in run_AI_all
    model_feynman = NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5")
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_NN_train.py", line 130, in NN_train
    loss = rmse_loss(model_feynman(fct),prd)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_NN_train.py", line 96, in forward
    x = F.tanh(self.linear1(x))
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)

If you need more data just ask.

ParticleTruthSeeker commented 2 years ago

Thanks for documenting this. The package could be useful if we can work through these issues.

LostArkRaider commented 1 week ago

Hi Sirisian, Were you able work around this issue? I am getting the same error message:

File "/home/ubuntu/anaconda3/envs/feyn/lib/python3.9/site-packages/torch/_tensor.py", line 1149, in array return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

I'm new to PyTorch, so any help you can offer will be appreciated.

Thank you