alexeedm / pytorch-fortran

Pytorch bindings for Fortran
MIT License
85 stars 12 forks source link

torch_tensor_to_array: pointer needs to be unassociated! #8

Closed sebk26 closed 1 year ago

sebk26 commented 1 year ago

Thank you @alexeedm for this nice repository.

I've build you're code with GCC 11.3.0 without container using the make_gnu.sh script on an AMD EPYC CPU 7352.

I did not used conda but pip but this should'nt be an issue here, i guess.

Environment

GCC 11.3.0 CUDA 11.8.0 NCCL 2.15.5 Python 3.10.4 PyTorch 2.0

Build script modifications to run with pip

-PYPATH=$(find /opt/conda/lib/ -maxdepth 1 -name 'python?.*' -type d)
+VIRTUAL_ENV="path/to/your/virtualenv/myenv"
+PYPATH=${VIRTUAL_ENV}/lib/python3.10

Known Issue with gnu compiler

As mentioned by @ch21d012 in #5 adding the proposed attribute target to value fixes the error. (I also tried using pointer attribute instead, but this fails to compile) So I edited this in the template file ./src/f90_bindings/torch_ftn.f90.templ

-        {dtype.fortran_id} ({dtype.fortran_prec}), intent(in)    :: value
+        {dtype.fortran_id} ({dtype.fortran_prec}), intent(in), target    :: value

Error

Apparently, I get a similar error as @ch21d012 mentioned in #6.

> ./install/bin/resnet_forward  ../examples/resnet_forward/traced_model.pt                                                                                                 
terminate called after throwing an instance of 'std::runtime_error'                                                                         
  what():  torch_tensor_to_array: pointer needs to be unassociated!                                                                         

Program received signal SIGABRT: Process abort signal.                                                                                      

Backtrace for this error:                                                                                                                   
#0  0x2af1b94ff3ff in ???                                                                                                                   
#1  0x2af1b94ff387 in ???                                                                                                                   
#2  0x2af1b9500a77 in ???                                                                                                                   
#3  0x2af1ff0e3879 in _ZN9__gnu_cxx27__verbose_terminate_handlerEv                                                                          
        at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95                                                                              
#4  0x2af1ff0ef2e9 in _ZN10__cxxabiv111__terminateEPFvvE                                                                                    
        at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48                                                                            
#5  0x2af1ff0ef354 in _ZSt9terminatev                                                                                                       
        at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:58                                                                            #6  0x2af1ff0ef5a8 in __cxa_throw                                                                                                           
        at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:95                                                                                
#7  0x2af1b8d382c0 in torch_throw_cpp 
        at pytorch-fortran/src/proxy_lib/torch_proxy.cpp:169                          
#8  0x2af1b8d20b5d in __torch_ftn_MOD_torch_tensor_to_2_fp32                                                                                
        at pytorch-fortran/gnu/build_fortproxy/torch_ftn.f90:729                      
#9  0x4014b0 in resnet_forward                                                                                                              
        at pytorch-fortran/examples/resnet_forward/resnet_forward.f90:53              
#10  0x40115c in main                                                                                                                       
        at pytorch-fortran/examples/resnet_forward/resnet_forward.f90:23              
Aborted                                                                                        

This happens for the resnet_forward, polynomial and python_training example every time an array pointer is passed to out_tensor%to_array(output).

My (probably dirty) workaround

Calling a nullify(output) somewhere before call out_tensor%to_array(output) solves the problem and the tensor is able to be converted to an array and printed.


I justed wanted to leave this here, in case someone else is struggling. Let me know, if there are other fixes that I missed and with which I can also compile the pytroch-fortran bindings with the GNU GCC compiler.

alexeedm commented 1 year ago

Hi @sebk26 Seems like I've wrongly treated a pointer which has only been declared. Apparently the standard says that the pointer is undefined (rather than unassociated). My bad. The check here was mainly to prevent the unnoticed memory leaks, when the pointer that was allocated elsewhere is passed to to_array function. It's probably an overkill to check that, I'll just have to remove this check

sebk26 commented 1 year ago

Hi @alexeedm But if I understood the way fortran is handling pointers right, your check is correct. If a pointer is associated with a target and then reassigned by e.g. the to_array func but the target is not deallocated, this could lead to a memory leak indeed. I just don't know what Best-Practise is in terms of handling declared-but-not-assigned pointers.

I would suggest just fixing the examples and calling nullify after declaration of output.

alexeedm commented 1 year ago

@sebk26 I thought exactly the same, but seems we're wrong. At least up to Fortran 2008 the standard says:

16.5.2.2 Pointer association status
A pointer may have a pointer association status of associated, disassociated, or undefined. Its association status
may change during execution of a program. Unless a pointer is initialized (explicitly or by default), it has an
initial association status of undefined. A pointer may be initialized to have an association status of disassociated
or associated

And then:

13.7.16 ASSOCIATED (POINTER [, TARGET])
...
3 Arguments
POINTER shall be a pointer. It may be of any type or may be a procedure pointer. Its pointer association
status shall not be undefined.

So that's my mistake actually

sebk26 commented 1 year ago

But this means, that inserting a nullify is the proposed fix?

alexeedm commented 1 year ago

Correct, I just don't want to force users to do so. I'm removing the check instead

alexeedm commented 1 year ago

It's ok, I'll do it myself, I need to patch the issue with target anyways. Thanks for your help!

sebk26 commented 1 year ago

I had a second pull request with that ready as well but I guess this is obsolete now.

Closing this issue as soon as your changes are pushed.

alexeedm commented 1 year ago

I've just updated the repo, please check it out. Your issues should be fixed now

sebk26 commented 1 year ago

Checked out and can confirm that #5 and this issue are fixed now. Thank you @alexeedm !