dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

subscription api timing out on large dataset #1079

Open vlimant opened 7 years ago

vlimant commented 7 years ago

I cannot get https://cmsweb.cern.ch/phedex/datasvc/json/prod/subscriptions?block=/Neutrino_E-10_gun/RunIISpring15PrePremix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v2-v2/GEN-SIM-DIGI-RAW%23*&node=T2_DE_DESY&collapse=n to not timeout and therefore unified cannot identify programatically the location of the pileup. Is there a way to break down further the request so as to get it to converge ?

FYI @areinsvo @sidnarayanan

nataliaratnikova commented 7 years ago

The Oracle DBAs reported a query plan instability issue for the subscription query, which may hinder the performance in unpredictable ways.

nataliaratnikova commented 7 years ago

Hi Jean-Roch, I can reproduce the 502 error after ~5 min wait time in case of wild-card query.
It works fine if the full block name is specified. Also, it returns momentarily with block replaced by a dataset in the query. If pileup samples are subscribed on dataset level, then perhaps the dataset based query like the one below [1] would be good for your check? You can get all blocks in the dataset from the data API [2], or all dataset blocks at a given node from blockreplicas API [3].

Meanwhile I will follow up with Kate on performance issue. She mentioned at yesterday's compops meeting about seeing about 100 concurrent sessions for the subscriptions query. If this is initiated by the Unified scripts, could you point me to the corresponding code. I'd like to see if there is a way this could be optimized. Thanks, Natalia. [1] https://cmsweb.cern.ch/phedex/datasvc/json/prod/subscriptions?dataset=/Neutrino_E-10_gun/RunIISpring15PrePremix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v2-v2/GEN-SIM-DIGI-RAW&node=T2_DE_DESY&collapse=n [2] https://cmsweb.cern.ch/phedex/datasvc/doc/data [3] https://cmsweb.cern.ch/phedex/datasvc/doc/blockreplicas

vlimant commented 7 years ago

Hi @nataliaratnikova , are you suggesting that instead of the wildcard search I first list of blocks and make a phedex call per block ? I can do that of course, no pb, I am unsure on how much load this will put on datasvc.

Unified does not do concurrent calls to the subscription API @sidnarayanan might be able to say more about transfer team, @yiiyama for dynamo. Is there a way you can trace the IP from which the numerous concurrent calls are coming from ?

sidnarayanan commented 7 years ago

AFAIK the transfer team should not be making 100 concurrent calls to the subscriptions (or any) API.

yiiyama commented 7 years ago

Dynamo can issue up to 64 concurrent blockreplicas queries, but it shouldn't be using subscriptions. I will double check but indeed it will be great if the IP can be known.

nataliaratnikova commented 7 years ago

I found ~40K hits coming from the MIT server, which constitute the majority of all calls to susbscription API. However, this is not necessarily the reason for the problem with the large datasets Jean-Roch reported here. Will investigate further.

vlimant commented 7 years ago

mit server is dynamo @yiiyama indeed. @nataliaratnikova should I switch to making a two stage queries (data => subscription) ?