exasol / pyexasol

Exasol Python driver with low overhead, fast HTTP transport and compression
MIT License
72 stars 39 forks source link

Question regarding the subprocess for http_transport and non default module search paths #67

Closed tkilias closed 3 years ago

tkilias commented 3 years ago

Hi @wildraid,

we saw a bit strange behavior of pyexasol in AWS Lamda. Everything except the http_transport works. We saw the following error message:

/var/lang/bin/python3.7: Error while finding module specification for 'pyexasol_utils.http_transport' (ModuleNotFoundError: No module named

With the following stacktrace:

df = C.export_to_pandas(stmt)                                         # Get the result (prefix/folders to be imported) of the stmt-query into a dataframe

File "/opt/python/pyexasol/connection.py", line 271, in export_to_pandas return self.export_to_callback(cb.export_to_pandas, None, query_or_table, query_params, callback_params, export_params) File "/opt/python/pyexasol/connection.py", line 335, in export_to_callback raise sql_thread.exc File "/opt/python/pyexasol/http_transport.py", line 34, in run self.run_sql() File "/opt/python/pyexasol/http_transport.py", line 153, in run_sql self.connection.execute("\n".join(parts)) File "/opt/python/pyexasol/connection.py", line 186, in execute return self.cls_statement(self, query, query_params) File "/opt/python/pyexasol/statement.py", line 55, in init self._execute() File "/opt/python/pyexasol/statement.py", line 159, in _execute 'sqlText': self.query, File "/opt/python/pyexasol/connection.py", line 572, in req raise cls_err(self, req['sqlText'], ret['exception']['sqlCode'], ret['exception']['text']) END

My guess, is that, the starting of the subprocess for the http_transport fails (see source reference below), because the pyexasol_utils module is not in the default module search path and the search path for the parent process was modified.

https://github.com/badoo/pyexasol/blob/3b5211fa78e4d83ea16e11532048f6cdcaeab43d/pyexasol/http_transport.py#L244

I would try next to get additional information about the environment with

import sys
import sysconfig
import os

print(sys.flags)
print(sys.path)
print(sysconfig.get_paths())
print(os.environ)
print(sysconfig.get_config_vars())

Any thoughts?

littleK0i commented 3 years ago

Can you set multiple directories in PYTHONPATH, like this: https://stackoverflow.com/questions/39682688/how-to-set-pythonpath-to-multiple-folders ?

Include standard directory AND custom directory at the same time.

littleK0i commented 3 years ago

In theory, it is possible to rewrite it from subprocess to multiprocessing (fork), but it would lead to loss of compatibility with Windows OS.

Windows is a bit lame for big data, but a lot of Exasol users still rely on it.

tkilias commented 3 years ago

I am actually, not sure what AWS Lambda does to the environment, so I first need to check this and maybe try to reproduce the error locally.

tkilias commented 3 years ago

I have the feeling, subprocess is fine, but we might need to start it with the right environment and python interpreter command line options. I will come back, as soon I have further information.

littleK0i commented 3 years ago

subprocess.Popen() should inherit the environment variables from the parent process.

Try dumping os.environ and see how it goes.

tkilias commented 3 years ago

Ok, yes PYTHONPATH would work, I tested it locally. I didn't think it would be this, but better to be sure. However, there are a few other ways to modify the module search paths. For example, sys.path is a not so uncommon way. This is the next thing, I will try.

tkilias commented 3 years ago

@wildraid Ok, I also tried modifying the sys.path before imporating pyexasol and this reproduces the error. I packaged my small test and attached here. pyexasol_subprocess.tar.gz Extract the tar, fix the DSN in the python files and run the following

docker build -t pyexasol_test  .
docker run --net=db_network_test pyexasol_test bash run_tests.sh

The first test uses PYTHONPATH and is successful and second uses the sys.path and fails.

Tomorrow, I am going to check, what other mechanisms might modify the module search path and which one the AWS Lambdas use.

Have a good evening.

littleK0i commented 3 years ago

Well, it is easy to explain such results.

Changing sys.path affects current Python interpreter only. But PYTHONPATH is an environment variable, which is automatically inherited by subprocess and applied to newly started Python interpreter.

tkilias commented 3 years ago

@wildraid yep, you are completely right, and I think, AWS Lambdas do that for some reason to allow the usage of additional python packages. If I read the python documentation correctly, the sys.path is the only other way to manipulate the module search path besides the PYTHONPATH. So maybe a workaround could be to append the sys.path to the PYTHONPATH of the current process before calling export_to_pandas. As in the following example.

import os
additional_python_path =  os.pathsep.join(sys.path)
if not "PYTHONPATH" in os.environ:
    new_python_path = additional_python_path
else:
    current_python_path = os.environ["PYTHONPATH"]
    new_python_path = current_python_path + ":" + additional_python_path
os.environ["PYTHONPATH"] = new_python_path

import pyexasol

c=pyexasol.connect(dsn="172.18.0.2:8888", user="sys", password="exasol")
df=c.export_to_pandas("select * from test.comp1")

A cleaner solution could be, to only add the sys.path to the PYTHONPATH variable for the environment of the subprocess with the env argument. What do you think?

littleK0i commented 3 years ago

Can we just set a global PYTHONPATH for specfic use case involving AWS Lambda? And call it a day.

We definitely cannot touch sys.path or PYTHONPATH inside the library code, since it can mess up the higher level applications using pyexasol.

tkilias commented 3 years ago

Maybe, not sure. In the moment, I can't test it with AWS lambda. A problem, I could think of is, that the path to the module code might be not static in AWS Lambdas. In that case, the code of the lambda would need to change the PYTHONPATH variable, as in my example.

littleK0i commented 3 years ago

@tkilias , I suspect you've managed to resolve this issue using PYTHONPATH.

Is it the case? :)

tkilias commented 3 years ago

@wildraid I think so, the AWS Lambda environment is quite specific, so a general fix in the code is probably not productive and might cause more problems than it solves. Adding the sys.path to the PYTHONPATH environment variable within the same python process is at least a workaround. For that reason, I am going to close the ticket. Thx, for your help.