support query to get number of events per dataset per run

fwyzard commented 8 years ago

Hi, I would like to be able to run a query to get the number of events in one or more datasets, for a given run. I would guess something like this

file dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | sum(file.nevents)

but it clearly does not work, as I get

sum(file.nevents)=N/A

fwyzard commented 8 years ago

By the way, I'm trying to run the equivalent aggregation offline, with something like

for RUN in 259721; do
    for DS in /ZeroBias{1..4}/Run2015D-v1/RAW; do
        for FILE in $(das_client.py --limit 0 --query "file dataset=$DS run=$RUN"); do
            N=$(das_client.py --limit 0 --query "file=$FILE | grep file.nevents")
            printf "%16d%16d    %s\n" $RUN $N $FILE
        done
    done
done

but this is extremely slow, order of 10 seconds or more per file.

vkuznet commented 8 years ago

Andrea, you need to use this query

summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721

and it will provide you with summary about dataset

Number of blocks: 2, Number of events: 2290740, Number of files: 326, Number of
lumis: 414, Sum(file_size): 855.7GB

The reason we end-up with dedicated query is structure of DBS APIs. The one you're looking for didn't look-up file attributes and only provided file names. The DBS folks created another query which I used in DAS to provide summary for datasets. Please review and close the ticket. Best, Valentin.

On 0, Andrea Bocci notifications@github.com wrote:

Hi, I would like to be able to run a query to get the number of events in one or more datasets, for a given run. I would guess something like this
file dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | sum(file.nevents)
but it clearly does not work, as I get
sum(file.nevents)=N/A
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247

fwyzard commented 8 years ago

hi Valentin, thanks, indeed

summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | grep summary.nevents

works as expected.

Would it be possible to support also something like

summary dataset=/*/Run2015D-v1/RAW run=259721 | grep summary.nevents

?

vkuznet commented 8 years ago

Andrea, No, patterns are not acceptable for summary queries. It is DBS policies since patterns will lead to full table scan on most populated tables (dataset, block, files) and user usually provide very poor patterns, e.g. your example exactly shows that your query will force to scan all datasets, blocks, files. Best, Valentin.

On 0, Andrea Bocci notifications@github.com wrote:

hi Valentin, thanks, indeed
summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | grep summary.nevents
works as expected.

Would it be possible to support also something like
summary dataset=/*/Run2015D-v1/RAW run=259721 | grep summary.nevents
?

Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181847745

fwyzard commented 8 years ago

But then I have to do exactly the same scan "by hand" in a script, which will result in the same (or higher) load on DBS, no ?

vkuznet commented 8 years ago

No, the load is not the same. The full table scan needs to merge in memory O(100K) datasets, O(10-100M) files, O(1M) blocks in order to answer your query. While if you provide full dataset paths, the DBS should have 1 dataset + N blocks + N files. As you can see these are WAY different order of magnitude. Sending multiple queries to DBS is not the same as sending 1 query which need to do all merging. And from your example you only need 4 datasets: /ZeroBias{1..4}/Run2015D-v1/RAW and may be some runs. Instead of your original 3 for loops, now you only need 2 loops, one for run and one for summary. No files involved.

On 0, Andrea Bocci notifications@github.com wrote:

But then I have to do exactly the same scan "by hand" in a script, which will result in the same (or higher) load on DBS, no ?

Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181855220

fwyzard commented 8 years ago

Thanks, I always appreciate when people answer my feature requests telling me what I need instead.

DAS could be made exactly as smart as it needs to be to do the "limited" query. If I query DAS with

summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]

it has the exact same information I have, and can use to

first make query to get the list of matching datasets
then make a query to restrict the datasets to those available for those runs
finally, make the "summary" request

Actually, DAS could probably combine the first two steps, if it makes the overall query faster.

Yes, I can implement the same logic in a client-side script. But please do not try to tell me that doing it client side is faster and poses less load on the system.

.Andrea

vkuznet commented 8 years ago

Andrea, I already made DAS smart enough about fetching data from different DBs, data-services, managing to handle different data-formats, notations, etc. And yet users always want more :) But what you outline leads so simple problem, time. And since your client is using HTTP connection it can simply timeout when DAS will do your smart thing. Doing this in client is equivalent of doing this in DAS server, but

it means DAS server will be stuck with your query for long time
it may time out due to HTTP nature of frontend DAS does not control. That's why doing this on client side is more appropriate approach.

But, DAS supports summary queries with run ranges, e.g. summary dataset=/SingleMu/Run2011B-WMu-19Nov2011-v1/RAW-RECO run in [177718, 177053] works just fine.

The only missing part is pattern in dataset name, which basically means for you to have only one loop, not even two.

On 0, Andrea Bocci notifications@github.com wrote:

Thanks, I always appreciate when people answer my feature requests telling me what I need instead.

DAS could be made exactly as smart as it needs to be to do the "limited" query. If I query DAS with

summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]

it has the exact same information I have, and can use to

first make query to get the list of matching datasets

then make a query to restrict the datasets to those available for those runs

finally, make the "summary" request

Actually, DAS could probably combine the first two steps, if it makes the overall query faster.

Yes, I can implement the same logic in a client-side script. But please do not try to tell me that doing it client side is faster and poses less load on the system.

.Andrea

Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181911957

HeinerTholen commented 8 years ago

Hi guys, just wanted to add my 2ct: This is indeed a problem that I've came across more often: getting a list of dataset along with number of events for a pattern. I would already be happy, if there was a section on the FAQ or in the twiki with a minimal example. Some little script, that I paste into my shell. How about that?

EDIT: Thanks, of course, for the DAS development, it's an awesome tool! =)

vkuznet commented 8 years ago

Heiner, it really depends on end-user since I don't know how you organize your work, do you work with bash, tcsh, python, web, whatever else?

I gave you a recipe

perform query to find list of dataset, e.g. dataset=/your_pattern_here

das_client.py --query="dataset=/your_pattern_here"
parse #1
place another set of queries with you favorite runs, e.g. summary dataset=/a/b/c run in [1,2,3] summary dataset=/c/d/e run in [2,3,4]

das_client.py --query="summary dataset=/a/b/c run in [1,2,3]"

Please note that for step #2 you can either use das_client.py as is, you can use --JSON option to get back data in JSON format.

You can write your steps #1,2,3 in bash, tcsh, python. You may be interested to look-up data from different instances as well and therefore your query will need to be modified accordingly to use instance=XXX clause.

So, your mileage is vary as you go and I don't think it is up to me to write for you your scripts.

But you can even bypass DAS entirely to avoid its latency and perform all steps against DBS directly using DBS APIs. For instance, here is I'll do it using curl

make scurl alias alias scurl="curl -k --key ~/.globus/userkey.pem --cert ~/.globus/usercert.pem"
define DBS url URL=https://cmsweb.cern.ch/dbs/prod/global/DBSReader/
find DBS datasets for my pattern scurl "$URL/datasets/?dataset=YOUR_PATTERN"
parse DBS output
find summary info from DBS scurl "$URL/filesummaries/?dataset=/a/b/c&run_num=1,2,3"

It really depends on what you want to do and how you want to do it.

On 0, HeinerTholen notifications@github.com wrote:

Hi guys, just wanted to add my 2ct: This is indeed a problem that I've came across more often: getting a list of dataset along with number of events for a pattern. I would already be happy, if there was a section on the FAQ or in the twiki with a minimal example. Some little script, that I paste into my shell. How about that?

Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181925730

vkuznet commented 7 years ago

resolution has been provided in terms of multiple query look-up

fwyzard commented 7 years ago

which is just a way of saying to the users "you are on your own"

vkuznet commented 7 years ago

Andrea, you can draw your own conclusion, but implementing full stack of steps in DAS is not reliable due to simple reason, the latency. If I take your example:

summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]

first make query to get the list of matching datasets
then make a query to restrict the datasets to those available for those runs
finally, make the "summary" request

and implement all steps, most likely such query will take minutes and timeout. The problem is that user may supply very loose query in first item, e.g. /*/*/RAW. This means that DAS will need to get N datasets from DBS (where N is large), then loop over them and send as many as N requests again to get the summary. As much as I want to implement this I can't to avoid making DDoS attack on DBS because users are LAZY to know their datasets.

fwyzard commented 7 years ago

Maybe users are not lazy, they would simply like to know ow many events are available in a given list of datasets for a given list of runs ?

Yes, I can wrap access to DAS in some kind of script:

DS="/ZeroBias*/Run2015D-v1/RAW"
RUNS="258425, 259626, 259721, 260627"

for D in $(das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0); do
    printf "%s\t%s\n" $D $(das_client.py --limit 0 --query "summary dataset=$D run in [$RUNS] | grep summary.nevents")
done

to get my answer:

/ZeroBias/Run2015D-v1/RAW       3746654
/ZeroBias1/Run2015D-v1/RAW      9768630
/ZeroBias2/Run2015D-v1/RAW      9769884
/ZeroBias3/Run2015D-v1/RAW      9770423
/ZeroBias4/Run2015D-v1/RAW      9770209

While I still don't see why it is OK if I do this, and it is a problem if DAS does it for me, there is probably some deep issue I don't understand.

fwyzard commented 7 years ago

(yes, it is rather slow, though I guess it could be done much faster if I ran all but the first das query in parallel)

vkuznet commented 7 years ago

Andrea, what if a user will do this: summary dataset=/*/*/*RAW run in [RUNS] this is the same type of query you have, i.e. it contains a pattern, but pattern IS VERY LOOSE. In your case we need to look-up of 5 datasets, but my example we'll have 19131 datasets instead of 5. Now if your query is slow when you run you script imaging how slow will be to look-up N-thousand datasets.

The point is that wild-card is hard to support for generic case, and from support point of view we should either support it or not.

Since I can't guarantee that users will put loose patterns I can't permit such queries on a server.

That's why I'm resistant to implement the logic in a server. You may argue or even assure that you'll be careful, etc., but we can't guarantee that in some instances user will not place a loose query which will cause server to stuck. Therefore I still think that it is should be done on a client side rather a server. Clients know what they are doing explicitly and we can trace down such clients and help and fix issues.

Best, Valentin.

On 0, Andrea Bocci notifications@github.com wrote:

Maybe users are not lazy, they would simply like to know ow many events are available in a given list of datasets for a given list of runs ?

Yes, I can wrap access to DAS in some kind of script:
DS="/ZeroBias*/Run2015D-v1/RAW"
RUNS="258425, 259626, 259721, 260627"

for D in $(das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0); do
    printf "%s\t%s\n" $D $(das_client.py --limit 0 --query "summary dataset=$D run in [$RUNS] | grep summary.nevents")
done
to get my answer:
/ZeroBias/Run2015D-v1/RAW       3746654
/ZeroBias1/Run2015D-v1/RAW      9768630
/ZeroBias2/Run2015D-v1/RAW      9769884
/ZeroBias3/Run2015D-v1/RAW      9770423
/ZeroBias4/Run2015D-v1/RAW      9770209
While I still don't see why it is OK if I do this, and it is a problem if DAS does it for me, there is probably some deep issue I don't understand.

-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-270678015

fwyzard commented 7 years ago

Actually the first check das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0 would still limit the following query to just 54 datasets, but I see your point.

vkuznet commented 7 years ago

I'm glad you understand the issue. So far I don't see how I can resolve this on server side without putting it at risk for wild-card queries. I rather think we need to start developing series of tools to address such issues on a client side. I'm working on this direction to use DBS on HDFS snapshot where such aggregation can be done. And, I put this use case to my project list (https://gist.github.com/vkuznet/2372055e3a4b72731ddfa06882efd366)

Here, I'm closing the ticket with output I already provided, i.e. it can be done as series of queries at a client side.

vkuznet commented 7 years ago

The new project I'm aiming to delegate this issue is titled as DBS/PhEDEx aggregation using Hadoop+Spark platform on gist.

dmwm / DAS

support query to get number of events per dataset per run #4247