Open vkuznet opened 5 years ago
Jenkins results:
Jenkins results:
@yuyiguo just to give you a context. The changes Valentin is proposing here are meant to fix a curl problem that we hit every day in production. Error reported here: https://github.com/dmwm/WMCore/issues/9221
Thanks, Valentin!
@vkuznet @amaltaro We are not going to deploy this For June, right? I will come back this late..
Given the high impact of this change, I'd recommend against pushing it now and having only a couple of days to validate it.
I need to do the code review and testing before cmsweb-testbed. It would be really hard to make it since production is next Tuesday. Will try, but not guarantee.
@yuyiguo what I meant is that you should skip it from this cycle and wait for the July upgrade. It's too dangerous to insert such changes at this stage.
@amaltaro Thanks for correcting me. I read your message too quick . yes, put it in July release was my plan too.
BTW, this will of course happen more frequently now that we have DBS running with twice more threads. Which I can confirm by scanning one of the agents logs, it happens 10s of times per day.
@amaltaro How many concurrent DBS connections wmagents need? How often these connections shared/reused?
@vkuznet "setting curl options independently from request execution" does not mean to make the connection pool. Curl options can be set on single connection too? How to understand that the original issue has to be solved by connection pool?
Yuyi, DBS is used in a few different places in central production and most of them are just sequential, so a single thread handling DBS requests. For the DBS3Upload component, which uploads/inserts data into DBS, we seem to use 4 processes to go through the pool of requests to be made.
Problem reported is actually happening on the sequential application (WorkQueueManager); where a DBSReader (and DbsApi) instance is created: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33
and then reused through all the datasets to be handled (in different http requests though). I don't see where the cache gets deleted as well, so the same DBS object is used among different component cycles.
Alan, If we use the pool connection, then we will have a limit on the pool size of 4 or 5. However, I still don't see how the problem will be solved by introduce the pool. I reuse my connections with different APIs, but I might not do as much times as workQueueManager. Any ideas on how frequent this error happened? or after how many time the error showed up? can we reproduce the error in a simple code?
Yuyi, I haven't looked at the internals of pycurl. What I understood is that each http request would use a different (idle) connection, such that the same object isn't shared among different client requests. If that's correct, then I assume the connection pool should have at least the same length of the number of cherrypy threads.
About reproducing this error, I unfortunately don't have any recipe. It happens a few 10s of times per day. Grepping the global workqueue logs, you can see how many times it happened over the last 3 days:
amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190613.log | wc -l
28
amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190612.log | wc -l
18
amaltaro@vocms055:/build/amaltaro/WMCore $ grep 'cannot invoke setop' /build/srv-logs/vocms0740/workqueue/workqueue-20190611.log | wc -l
62
Alan, What I understood was that each client held their own pool. So you and I would have different pools if we both connect to DBS at the same time. Then if we both held a pool with the length is the number of cherrypy threads, we will exceed the total number of threads?
This is how WorkQueueManager make DBS connection. Could you pointed me where the code invoke setop?
I think you're mixing different pool concepts. One pool can be on server side and another on client side. My changes create the later, i.e. the client who will get RestApi
class instance will get a pool of connection. The pool is dynamic in size, i.e. it will grow to a number of concurrent/independent/parallel invocation of execute API. In other words, the new curl object is added to pool within execute
API if it does not exists. Therefore, in your client if you call execute
from different threads/processes the pool will grow up to that number of threads/processes. I think it is desired behavior. Of course, we can constraint pool to a certain size, but then in order to get connection from it client code will need to wait when it will be available.
@vkuznet Could you add pool size ? And address the question: why connection pool could fix the problem ? The error was from the sequential application (WorkQueueManager).
Yuyi, I added curl pool size as an external parameter and I changed curl retrieval from the pool to be random. Valentin.
On 0, Yuyi Guo notifications@github.com wrote:
@vkuznet Could you add pool size ? And address the question: why connection pool could fix the problem ? The error was from the sequential application (WorkQueueManager).
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/pull/610#issuecomment-502119087
@vkuznet Valentin, Thanks for the code updating. I still did not get how this will solve the problem? Yuyi
Yuyi, I think the error message is clear: ERROR:CherryPyPeriodicTask:Periodic Thread ERROR pycurl.error cannot invoke setopt() - perform() is currently running
It means that in existing code when dbs client creates a new pycurl object there is another one in another thread. The first object tries to apply setopt, while object in another thread execute perform(). Since DBS code creates all the time pycurl object the pool will ensure that this will not happen, i.e. object will be created up to the pool size and then reused, meaning that there will be no need to call setopt() again on them.
V.
On 0, Yuyi Guo notifications@github.com wrote:
@vkuznet Valentin, Thanks for the code updating. I still did not get how this will solve the problem? Yuyi
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/pull/610#issuecomment-503598425
@vkuznet Valentin, Please see below Alan's comment. It is a single thread/process. " Problem reported is actually happening on the sequential application (WorkQueueManager); where a DBSReader (and DbsApi) instance is created: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33 " You will find his entire comment in this thread. He made the comment about a week ago.
Yuyi
Yuyi, the code you point is telling nothing about usage of dbsClient, while Alan explicitly said quoting: "in different http requests though". This is where you need to dig since it is how http requests are done. If you provide pointer to the code in WorkQueue which do http requests then we/you can investigate.
@vkuznet @amaltaro What I understood "in different http requests though" meant that Alan used the same DBS client connection to call different DBS APIs sequentially. Alan, Could you please point out the code in WorkQueue which do http requests then we/you can investigate as Valentine asked? Yuyi
The traceback is avaialble in the issue opened against WMCore, for full information.
And here is the place where the actual DBS request is made (a simple serverinfo call to evaluate which DBS instance we're talking to): https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataLocationMapper.py#L38
For completeness of what I said before, these DBS calls are made in a sequential mode, but I figured that both local and global workqueue are multithreaded. My original report was based on the global workqueue, which runs mostly 2 cherrypy threads performing distinct operations, e.g.: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/ReqMgrInteractionTask.py#L28 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/LocationUpdateTask.py#L23
but diving in the code, they seem to use the same DBSReader and DbsApi object, using this utilitarian function that does caching: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33
If I got all this right, making sure that each thread has its own DBSReader/DbsApi instance would solve this problem. Did I miss anything?
I think each thread has its own DBSApi object should solve the problem. Could you please test in one of the workqueue?
If app is multithreaded then it explain the issue. And for that I should tweak further the DBS RestApi to make it thread safe. So far it is not since RestApi invokes HTTPRequest and this is where things may go through race conditions in threads, see this function https://github.com/dmwm/DBS/blob/master/PycurlClient/src/python/RestClient/RequestHandling/HTTPRequest.py#L49
The call setup new headers for every request, that means that if we use the same object in different threads we may end-up in situation where one thread will call perform, while another will invoke setop for passed curl_object.
So, in RestApi I should introduce pool which will be thread safe, e.g. it will be a dict which hold one curl object per thread.
Then, in each thread this object will be only called sequentially.
I can do changes next week.
On 0, Alan Malta Rodrigues notifications@github.com wrote:
The traceback is avaialble in the issue opened against WMCore, for full information.
And here is the place where the actual DBS request is made (a simple serverinfo call to evaluate which DBS instance we're talking to): https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataLocationMapper.py#L38
For completeness of what I said before, these DBS calls are made in a sequential mode, but I figured that both local and global workqueue are multithreaded. My original report was based on the global workqueue, which runs mostly 2 cherrypy threads performing distinct operations, e.g.: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/ReqMgrInteractionTask.py#L28 https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/GlobalWorkQueue/CherryPyThreads/LocationUpdateTask.py#L23
but diving in the code, they seem to use the same DBSReader and DbsApi object, using this utilitarian function that does caching: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueueUtils.py#L33
If I got all this right, making sure that each thread has its own DBSReader/DbsApi instance would solve this problem. Did I miss anything?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/pull/610#issuecomment-503823129
I agree with Valentin. Having DbsApi thread-safe will be important for future larger activities. I'm also going to reopen the WMCore issue such that I can address it once I get back from vacation.
Sounds a good plan, Valentin and Alan. I am very curious that if we still see the same error with the current version when each thread has its own DBS connection object. Alan, could give this a try? Yuyi
Valentin, I came back to this PR. Could you answer the question I wrote in the code comment? How big the pool is going to be? Each requester has its own pool? In other words, a pool is not shared between different requesters?
Yuyi,
the pool is a member data of RestApi
class, it means that each instance of RestApi
class will have their own pool. It also means that of this instance will be used in different threads it will not overlap with other instances since it will carry its own curl pool. Therefore other calls to curl object will not affect each other since curl object within pool belong to concrete instance.
So, it is not about requester, it is about instances of the class. A single requestor may create multiple instance of RestApi
and each instance will carry its own pool. The pool will not be shared among different instances of RestApi
class.
Thanks Valentin ! Could you limit the max pool size for each instances? I will have the NTAS PR today and if it goes well, I may get this PR in for the HG2002 too if Alan can test it. Yuyi
Yuyi, I made it configurable already, please see
https://github.com/dmwm/DBS/blob/cbe3f691f487738ed7bc67eb5343dcb7bf373a15/PycurlClient/src/python/RestClient/RestApi.py#L11
the RestApi contains curl_pool_size
parameter which others can use to setup whatever they feel necessary for their use case. Do you want to change the default?
I put 10 as default value of curl_pool_size
.
Thanks, Valentin !
@vkuznet
I am picking up this again and plan to put it in the May release. I have two questions for your PR.
Q1
def getCurl(self): "Fetch one curl object form the pool or create a new one" if len(self.curl_pool): idx = random.randint(0, len(self.curl_pool)-1) curl = self.curl_pool[idx] else: curl = self.newCurl() return curl
How this getCurl will be thread safe? curl_pool starts with 0 object, then when new ones created and used by execute function, they will be put in the curl_pool. Let's say if the curl_pool has two objects after a while. Then two threads try to get objects from the curl_pool. What will make them to get a different object instead of same one? This is 50% to get one of them.
Q2:
def execute(self, http_request): "Execute given http_request with available curl instance" curl = self.getCurl() res = http_request(curl) if self.curl_pool_size: if len(self.curl_pool) < self.curl_pool_size: self.curl_pool.append(curl) else: self.curl_pool.append(curl) return res
What is the difference curl_pool_size none or 10? if None, the pool size is unlimited?
Thanks, Yuyi
Yuyi,
you are mixing two different concepts here, the pool
within the class and class thread-safeness. The curl pool
here refers to class RestApi
, i.e. this class holds the pool. Therefore if you use one instance of this class per thread then there is no issues with the pool, while if you use RestApi
instance of the class across many threads then the pool (as the class itself) is not thread-safe. Therefore within a thread, the RestApi instance has its own pool and it will speed up things as curl objects within the pool will be reused.
If you want to make RestApi
class thread-safe then additional changes will be required, e.g. add a Lock to curl pool.
The same applies to the execute
method of the RestApi
class. It is only thread-safe if there is one instance of the class.
The point is that a single RestApi
class instance can serve multiple requests though. Therefore the pool will be used for them, i.e. all your requests served by the RestApi
class instance will go through its own curl pool.
And curl_pool_size
controls how large is your pool. If you will not providecurl_pool_size
your pool will be unlimited, i.e. the pool will grow with each request.
@vkuznet Valentin,
I understood your explanation on the thread safe issue, but how this PR will solve @amaltaro 's original problem. He was running multiple threads and inside a thread the operation was sequential. So 1. thread safe is required. 2. pool is nice to have, but it will not help him since the operation is sequential. Am I right?
If curl_pool_size is none, may the unlimited pool cause memory leaks on the client side? I think if curl_pool_size is none, we make it no pool just one curl object or pool size 1.
Do you think what I suggested make sense? If so, could you please update the PR?
Yuyi, I don't have time now to work on this.
As I wrote you should discuss with Alan if thread-safe code is required. This is up to application which uses RestApi
, i.e. if one instance of the class is used across many threads than we need to have thread-safe code, if not that it is not required. As I wrote to make code thread safe we need to add Locks to the pool such that curl object from the pool will not be modified between threads. But thread-safeness has nothing to do with sequential/concurrent issue you're talking about. You can still use threads and be sequential or concurrent. In latter case more changes are required to run concurrent requests.
Regarding pool, I provided sensitive default 10. If you want to change code when no pool size is provided then you should modify code accordingly, I don't have time for that anymore.
@vkuznet Thanks for your reply. @amaltaro Is thread safe are required for this PR? Thanks, Yuyi
Jenkins results:
Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-DBS-PR-test/113/artifact/artifacts/PullRequestReport.html