Open sirisian opened 3 years ago
Had more time to test. The readme is wrong to reference the medium article that changes the head. I used the package and it worked fine.
One minor change is pytorch 1.9.1 throws a CUDA error: no kernal image available for execution on the device. Switched to 1.8.2 and it worked fine.
The gpu usage is really low though like 7% during training. Is that normal?
I spoke too soon. It was training fine for hours then:
Complexity # MDL Loss # Expression
0.0 27.1 0.000000000000+x2
Training a NN on the data...
NN loss: (tensor(0.0002, device='cuda:0', grad_fn=<DivBackward0>), SimpleNet(
(linear1): Linear(in_features=9, out_features=128, bias=True)
(linear2): Linear(in_features=128, out_features=128, bias=True)
(linear3): Linear(in_features=128, out_features=64, bias=True)
(linear4): Linear(in_features=64, out_features=64, bias=True)
(linear5): Linear(in_features=64, out_features=1, bias=True)
))
Checking for symmetries...
Checking for separabilities...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 274, in run_aifeynman
PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 96, in run_AI_all
idx_min = np.argmin(np.array([symmetry_plus_result[0], symmetry_minus_result[0], symmetry_multiply_result[0], symmetry_divide_result[0], separability_plus_result[0], separability_multiply_result[0]]))
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/tensor.py", line 621, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
I fixed this, and I'm running a very long test to see if there's any other issues.
My fix was to go into the S_symmetry.py and S_separability.py and before the lines with return min_error...
to put:
if is_cuda and isinstance(min_error, torch.Tensor):
min_error = min_error.cpu()
When I'm back in Windows I'll do a pull request with the changes. I will say though I don't program python or torch, so does this fix look right? From what I can tell the min_error is on the GPU with CUDA and needs to be moved back before numpy can work with it. @SJ001 do you not get this issue when using CUDA? Seems strange that I'm the only one that sees this bug unless everyone else is using their CPUs?
I let it run with my dataset and after hours it hangs forever at this point:
aifeynman.run_aifeynman("/home/.../Desktop/test/", "data.txt", 480, "7ops.txt", polyfit_deg=3, NN_epochs=10000)
....
Trying to solve mysteries with brute force...
Trying to solve results/mystery_world_tan/data.txt_train-translated_plus-gen_sym-translated_plus
Rejection threshold..... 10.000
Bit margin.............. 0.000
Number of variables..... 6
Functions used.......... +*/>~R0
Arity 0 : 0abcdef
Arity 1 : >~R
Arity 2 : +*/
Loading mystery data....
506880 rows read from file mystery.dat
Number of examples...... 506880
Removing problematically
1587689.6939376462 1587.6897693488602
506713 out of 506880 data points discarded for being too close to zero
Shuffling mystery data..
Searching for best fit...
29.788690328314 0.002277975996 a 2 1.0000 4975.7113 25.8922 226.0018 3.1685 84.0000
29.779800338555 0.014233059519 cR 25 4.6439 4977.8705 25.8918 225.9984 3.1907 147.0800
29.775891131135 0.015494992578 fR 28 4.8074 4977.3812 25.8928 226.0109 3.2050 149.2143
29.773613671717 0.001402931063 fa+ 42 5.3923 4977.5858 25.8922 226.0033 3.2098 151.1905
29.772385420114 0.001761270442 fc+ 56 5.8074 4977.7957 25.8915 225.9956 3.2115 155.1429
29.764185792013 0.000078935191 aa* 86 6.4263 4977.0453 25.8912 225.9903 3.2276 143.8372
29.730417191761 0.032593591692 ac/ 149 7.2192 4972.1988 25.8942 226.0207 3.3104 136.9195
29.710088371075 -0.026905270931 fd~+ 322 8.3309 4969.9157 25.8987 226.1149 3.3127 143.6118
29.697570938432 0.016818717993 df~>+ 4933 12.2682 4971.7626 25.8961 226.0702 3.3687 133.6975
29.690145638461 -0.018553622986 eca~+/ 23022 14.4907 4972.7450 25.8906 225.9850 3.4014 133.8753
29.689609162648 0.001576072744 eac/*> 36826 15.1684 4973.3332 25.8941 226.0200 3.4604 133.9103
29.689072004598 0.326721197302 fd>~/> 58506 15.8363 4973.9113 25.8965 226.0790 3.3872 132.4078
29.688480787030 0.309902479310 f~>d/> 141939 17.1149 4975.0912 25.8965 226.0795 3.3886 131.7140
29.685064936527 -0.016661307256 fdfR+~+ 247142 17.9150 4975.3208 25.8995 226.1077 3.4284 128.6157
29.681217801006 -0.021084561074 eca~+/> 277416 18.0817 4974.8451 25.8892 225.9686 3.4085 128.4989
29.647439507779 0.508116382065 df~>+a/ 666496 19.3462 4970.4686 25.8992 226.1146 3.4624 126.0836
29.595071261945 -0.034777319032 dbaf~/*+ 1532305 20.5473 4962.9242 25.9012 226.2784 3.4518 126.8539
29.573903513664 -0.029834165542 dbfa>~//+ 16440576 23.9708 4962.8126 25.9018 226.2677 3.5392 123.2495
29.573903513664 -0.029834165542 dba>f~/*+ 26299866 24.6486 4963.4904 25.9018 226.2677 3.5392 122.2774
Checking polyfit
Pareto frontier in the current branch:
Complexity # MDL Loss # Expression
0.0 27.09 0.000000000000+x5
Found pretrained NN
^CTraceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 273, in run_aifeynman
PA = run_AI_all(pathdir,filename+"_train",BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA=PA)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 179, in run_AI_all
PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 241, in run_AI_all
PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 179, in run_AI_all
PA1 = run_AI_all(new_pathdir,new_filename,BF_try_time,BF_ops_file_type, polyfit_deg, NN_epochs, PA1_)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_run_aifeynman.py", line 71, in run_AI_all
model_feynman = NN_train(pathdir,filename,NN_epochs/2,lrs=1e-3,N_red_lr=3,pretrained_path="results/NN_trained_models/models/" + filename + "_pretrained.h5")
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_NN_train.py", line 130, in NN_train
loss = rmse_loss(model_feynman(fct),prd)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/aifeynman/S_NN_train.py", line 96, in forward
x = F.tanh(self.linear1(x))
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/home/.../Desktop/AI-Feynman/feyn/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
If you need more data just ask.
Thanks for documenting this. The package could be useful if we can work through these issues.
Hi Sirisian, Were you able work around this issue? I am getting the same error message:
File "/home/ubuntu/anaconda3/envs/feyn/lib/python3.9/site-packages/torch/_tensor.py", line 1149, in array return self.numpy() TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
I'm new to PyTorch, so any help you can offer will be appreciated.
Thank you
I just created a fresh Ubuntu 20.04.3 LTS install and installed drivers and checked that pytorch was using CUDA and everything seems fine.
It's in S_run_aifenman.py line 85: idx_min = np.argmin(np.array([symmetry_plus_result.......
This error occurs after all the brute force lines. I'm not familiar with numpy or pytorch, so hopefully this is an obvious error on my part? This is the command I used to get pytorch
Then I followed that notebook on the linked site on the README. I'm using my own data that is similar to their example. Did something perhaps change recently that would cause this error with the code? Do I need to use an old pytorch? (I have a 3090 that I'm using for reference since I believe I need to use CUDA 11.1 or higher).