cloudera / impyla

Python DB API 2.0 client for Impala and Hive (HiveServer2 protocol)
Apache License 2.0
727 stars 249 forks source link

Error in running python scripts to initiate subprocess #63

Closed abhinavmishra590 closed 9 years ago

abhinavmishra590 commented 9 years ago

I have two python scripts which I am using to initiate subprocess. Following is the structure of my scripts:

MAIN.py

This scripts does nothing except initiating the whole process and calling another python script through subprocess

paramet = ['parameter'] result = subprocess.Popen([sys.executable,"./sub_process.py"] + paramet) result.wait() sub_process.py

This is the script which first executes bunch of SQL statements and then initiate another subprocess by calling itself again. The number of subprocess spawned depends on the parameter being passed with each subprocess call. Also the SQl statements are mostly insert statements

conn = connect(host=host_ip, port=21050, timeout=3600)
cursor = conn.cursor()

sql = "SQL statement"
cursor.execute(sql)
result_set = cursor.fetchall()
for col1, col2 in result_set:
    sql = "SQL statement 2"
    cursor.execute(sql)

    sql = "SQL statement 3"
    cursor.execute(sql)

    if paramet == 'para':
           result = subprocess.Popen([sys.executable,"./sub_process.py"] + paramet)
result.wait()

Now this setup works fine on my local machine but when I am trying to run the same setup on server it throws following error:

File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 151, in execute self._execute_sync(op) File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 159, in _execute_sync self._wait_to_finish() # make execute synchronous File "/usr/local/lib/python2.7/site-packages/impala/dbapi/hiveserver2.py", line 181, in _wait_to_finish raise OperationalError("Operation is in ERROR_STATE") OperationalError: Operation is in ERROR_STATE

Even if I execute only one sql statements instead of bunch of sqls the same error comes on server but not on my local machine. I tried to look for some reasons for this but couldn't find anything which might provide the reason for this error.

I have CentOS 6.6 on my server

How can I find the reason for this and resolve it? Also what could be a workaround to get the same behavior through other ways if this is not getting resolved? What I basically want is that once a process reaches a certain point in its execution, it should start another process and also continue to keep on executing simultaneously. Once it is done then it should wait for the subprocess to finish before exiting.

laserson commented 9 years ago

Sorry you're having trouble, @abhinavmishra590. It looks like you're trying to do something pretty tricky. First off, have you considered the Python threading or multiprocessing modules?

Also, what version of CDH, Impala, and Impyla are you using?

Also, if you look at the Impala web UI, is there any additional information given on the queries that fail? It's strange to me that your scripts would work from your local machine but not from a server on the cluster.

laserson commented 9 years ago

Also, is it possible for you to implement the recursion from a single script, rather than having a script spawn a new shell and call itself?

abhinavmishra590 commented 9 years ago

@laserson I have CDH 5.2.0, impala 2.0 and impyla 0.9

abhinavmishra590 commented 9 years ago

@laserson WANTED TO HAVE YOUR OPINION ON THIS. What I basically want is to start a process and when this process reaches a certain point in its execution (say a counter) then it should initiate execution of another process which will also go through same execution as this process is going and after certain time spawn execution of another process. This will go on till we hit a stoppage criteria after which no new execution will be spawned and processes will start to roll back down.

Now when a process spawns another execution it should also keep on executing itself simultaneously with the new process and once this parent process it should wait for completion of spawned process which in turn will wait for the process it spawned.

What could be a better way of achieving this functionality out of python threading and multiprocessing? If you could provide some pseudocode explanation that would be helpful for me.

laserson commented 9 years ago

I think the threading module will simply give you another way to describe the concurrent programming you want to accomplish.

Either way, that still doesn't solve your problem. For the query that errors out, does it even make it to Impala? If so, what does the query profile look like?

abhinavmishra590 commented 9 years ago

@laserson the errors that it throws is what I had mentioned in my post above. It does not provide any other info. I am not sure how can I check if the query reaches impala or not but the queries are insert statements and there is nothing that gets inserted.

laserson commented 9 years ago

So if the impalad hostname you're connecting to is: my.impala.host.com, you should be able to open a web browser at my.impala.host.com:25000 (the default port, though it may be configured differently for you).

abhinavmishra590 commented 9 years ago

@laserson that way it works. I have been running the same queries on the same server manually but when I tried to automate it through subprocess then it started to throw error

laserson commented 9 years ago

Which way does it work? What I am suggesting is to run your scripts in a way that generates errors, and then go see if there is more information on the errors on the Impala web UI.

abhinavmishra590 commented 9 years ago

@laserson I tried the python threading and still got the same errors as they were coming with subprocess

abhinavmishra590 commented 9 years ago

@laserson also I see those queries on impala but they show exception as state. When I click on the profile I see lot of information. Which information is worth looking into?

abhinavmishra590 commented 9 years ago

@laserson on the profile of queries shown as exception it says "Query Status: Memory limit exceeded"

laserson commented 9 years ago

That certainly helps clarify things. When you run it locally versus on your server, are you running it exactly the same? Perhaps there are more queries being generated on the server?

abhinavmishra590 commented 9 years ago

@laserson I execute the same script on my local machine as well

laserson commented 9 years ago

Hmm, this is certainly strange. My guess is that this is probably not impyla-related. Perhaps try reasking this question on the impala-user mailing list?

laserson commented 9 years ago

I'm going to close this for now. Feel free to open another issue if there are other problems.