Segfault when using timeloop-python with Tensorflow

suyashbakshi commented 2 years ago

Hello, I'm using timeloop-python bindings from a python project that involves using TensorFlow.

However, importing tensorflow while using timeloop-python bindings results in a seg fault. It appears that there is a conflict when using timeloop-python bindings in a program that also uses tensorflow/keras. My guess is that this is due to tensorflow and timeloop-python packages both having submodules named "Model". The issue can be reproduced by simply importing tensorflow into the "examples/model.py" example.

I understand this is a non-trivial issue, and probably does not concern with normal usage of this package. But I would appreciate any suggestions to solve this problem.

Otherwise, a possible solution that I could think was first to write mappings into a yaml file, and then using "os.system()" call to invoke an example script similar to "examples/model.py" that would read the yaml and evaluate the mapping. But obviously, this defeats the whole purpose of having these python bindings.

I'd appreciate any help.

Thank you

suyashbakshi commented 2 years ago

Hello, is there any possible solution to this? I would really appreciate any help.

Thanks

angshuman-parashar commented 2 years ago

Hi @suyashbakshi thank you for bringing this to our attention. I agree with you that this is probably a non-trivial issue, but I disagree that this is not an instance of "normal" use of this package. We definitely want to support this use case. You are also correct that going via a YAML file defeats the entire purpose of this approach.

We apologize for the delay and would truly appreciate your help in debugging this. Does renaming "Model" to something else, e.g., "TimeloopModel" resolve the issue?

suyashbakshi commented 2 years ago

Thank your for the response. I'm guessing that renaming would solve the problem, given that the simple act of importing tensorflow while using Timeloop-python causes the program to crash. Whether there are any other dependencies or issues causing the crash, I'm not entirely sure.

angshuman-parashar commented 2 years ago

Since you have everything set up could you test your hypothesis on the renaming (maybe rename it to TimeloopModel) and let us know? Or preferably even submit a patch?

gilbertmike commented 2 years ago

I'm also going to try to reproduce. I'll get back with my results.

gilbertmike commented 2 years ago

Update: I couldn't reproduce the error. I don't see a module named Model in PyTimeloop. I see ModelApp from pytimeloop.app and the module pytimeloop.model. I also can't find a Tensorflow module named model.

@suyashbakshi Could you give me more information about how you produced the error?

I agree with @angshuman-parashar that this is a not an uncommon use of this package. It's quite an important issue.

suyashbakshi commented 2 years ago

@angshuman-parashar I will try and get back.

@gilbertmike You can reproduce the error/crash by importing TensorFlow in the "examples/model.py" script.

The same error/crash occurs for my own project, where if I have the following imports in my script:

from pytimeloop.app import model
from pytimeloop import Config

and at the same time I import Tensorflow module as: import tensorflow as tf

gilbertmike commented 2 years ago

Ok, I see the error. Thank you.

gilbertmike commented 2 years ago

I haven't seen the root cause of the issue yet, but I think you might be right. I backtraced the crash with gbd and I saw that Tensorflow were calling pybind routines. PyTimeloop also uses pybind. Renaming doesn't help.

Anyway, I found that if I import Tensorflow before PyTimeloop, I don't see the crash.

suyashbakshi commented 2 years ago

@gilbertmike yes I had tried moving the import before/after each other and you're right that putting Tensorflow import before PyTimeloop doesn't cause the crash. But then the error reported is:

Cannot find 8 under root key: variables
ERROR: bad conversion, at line:

To be clear, the error message is not incomplete, that's the full error reported without any line number, source code file name from where that originated.

For the same arch, prob and map (yaml) files, the native timeloop-model tool does report expected statistics. Expected statistics are also reported if I use the pyTimeloop "example/model.py" without the TensorFlow import. So there's no problem with the input files.

gilbertmike commented 2 years ago

That's interesting. I'll have a look at it when I get back from work today.

Update: I haven't found the root cause for this. But here's my findings. I figured out the sequence of events leading up to the error:

YAML files are parsed by PyYAML
PyTimeloop config objects are created
YAML string dump from the "architecture" section of the config object is generated.
The YAML string is passed into Timeloop's config object ctor.
Error.

Interestingly, this error goes away when Tensorflow is not imported. This is a tricky one.

Based on this, I'm not sure if naming conflict is the cause of the error since the modules are already loaded at this point.

I'm going to add this to the testing suite.

@suyashbakshi Unfortunately, the bug is tricky enough that I'm not sure I can get a fix for this out soon. You might want to try the alternative method of using os.system if this is a roadblock in your project. Sorry about that.

gilbertmike commented 2 years ago

I took another look at this bug. I think I'm closer to the root cause.

Copy of backtrace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737350250496) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737350250496) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140737350250496, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7c8e476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c747f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cd56f6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7e27b8c "%s\n")
    at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007ffff7cecd7c in malloc_printerr (str=str@entry=0x7ffff7e25764 "free(): invalid pointer") at ./malloc/malloc.c:5664
#7  0x00007ffff7ceeac4 in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at ./malloc/malloc.c:4439
#8  0x00007ffff7cf14d3 in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#9  0x00007ffff66dbc1a in std::locale::_Impl::~_Impl() () from /home/gilbertm/timeloop-project/timeloop/env/../lib/libtimeloop-mapper.so
#10 0x00007ffff5a8cb26 in std::locale::~locale() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#11 0x00007fffc929352b in tensorflow::python_op_gen_internal::AttrValueToPython(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tensorflow::AttrValue const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /home/gilbertm/timeloop-project/venv/lib/python3.10/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

@angshuman-parashar Do you know where std::locale::_Impl::~_Impl() (line 9 in backtrace) would be called in Timeloop? I think it's double-freed. I think yaml-cpp might use std::locale which also explains the parse error.

suyashbakshi commented 2 years ago

Thanks @gilbertmike , unfortunately that is quite a big roadblock. For my project, an execution that should take order of 2-3 minutes, takes more than an hour to complete with os.system. I will keep trying alternate ways to invoke the timeloop-model from my project.

gilbertmike commented 2 years ago

Thanks @gilbertmike , unfortunately that is quite a big roadblock. For my project, an execution that should take order of 2-3 minutes, takes more than an hour to complete. I will keep trying alternate ways to invoke the timeloop-model from my project.

Do you mean using PyTimeloop takes more than an hour compared to 2-3 minutes native? (Nevermind, I misunderstood)

Could you provide more detail about what you're trying to do?

angshuman-parashar commented 2 years ago

@gilbertmike we used to use std::locale explicitly but as far as I can tell we aren't doing that any more. But it's possible that yaml-cpp (or boost?) are using it. @nellie-wu do you have any idea?

suyashbakshi commented 2 years ago

No, that's not what I meant. My project, essentially requires invoking timeloop-model in a loop, and within each iteration, a mapping needs to be evaluated. Up till now, I've been using the cost model from "ZigZag" framework, and using their framework to evaluate mappings, my project takes 2-3 minutes to execute. But "ZigZag" does not support mappings for 3D CNNs. My reason to use timeloop is because it supports 3D CNNs.

It is part of my research work, which I would be happy to chat about in detail via email.

gilbertmike commented 2 years ago

No, that's not what I meant. My project, essentially requires invoking timeloop-model in a loop, and within each iteration, a mapping needs to be evaluated. Up till now, I've been using the cost model from "ZigZag" framework, and using their framework to evaluate mappings, my project takes 2-3 minutes to execute. But "ZigZag" does not support mappings for 3D CNNs. My reason to use timeloop is because it supports 3D CNNs.

It is part of my research work, which I would be happy to chat about in detail via email.

@suyashbakshi Sure, you can reach me at gilbertm@mit.edu. Maybe we can work something out.

suyashbakshi commented 1 year ago

Hello, I managed to circumvent this issue. I am attaching my solution and a brief explanation below:

I modified the "app.Model.ModelApp()" class slightly and have its modified version in the "model.py" script (in the ZIP). The ZIP also has input files needed to run the script.

From my program, I collect all the potential mappings in a list that I want to evaluate, and dump them into a pickle file (map/map.pickle). I load this pickle file in model.py script (along with arch/arch.yaml and prob/prob.yaml) and use python Multiprocessing to spawn multiple processes and distribute the mappings in the list to processes and evaluate them.

Further, I modified the ModelApp class (https://github.com/Accelergy-Project/timeloop-python/blob/ca64f6c314c020e7498f0d102940748475db153b/pytimeloop/app/model.py) to not expect a "mapping" while creating the ModelApp class' object. Rather, the mapping is provided as part of the config in run(). Since the mappings belong to the same arch and prob, creating a new object for every new mapping does not make sense.

There is only one thing to keep in mind. The call to create a "ModelApp()" object (line #104 in model.py) does not seem to be multiprocessing friendly. By that I mean: if I do not put a lock around that call, then whichever process calls it first, is the process that will proceed to perform its work, and the rest of the processes will get killed and won't perform any evaluation.

Here's the description of files:

model.py: Contains driver method to load arch,prob and mapping files and initiate Multiprocessing. Also contains modified version of "ModelApp" class.
arch/arch.yaml: a sample architecture
prob/prob.yaml: a sample problem
map/map.pickle: Pickle file containing a list of mappings belonging to the arch and prob. To run, simply execute: python3 model.py

For now, this solves my problem with using PyTimeloop. Although it would be nice to know what's wrong with using this package with TensorFlow. Thank you for all the help, greatly appreciate it :)

pytimeloop_model_example.zip

gilbertmike commented 10 months ago

I merged a large PR recently. As part of the PR, the front end of PyTimeloop was significantly upgraded. It should be much more stable now. That might fix the issue.

If you don't mind trying it again, that would be greatly appreciated.

Accelergy-Project / timeloop-python

Segfault when using timeloop-python with Tensorflow #18