Closed fwyzard closed 7 years ago
By the way, I'm trying to run the equivalent aggregation offline, with something like
for RUN in 259721; do
for DS in /ZeroBias{1..4}/Run2015D-v1/RAW; do
for FILE in $(das_client.py --limit 0 --query "file dataset=$DS run=$RUN"); do
N=$(das_client.py --limit 0 --query "file=$FILE | grep file.nevents")
printf "%16d%16d %s\n" $RUN $N $FILE
done
done
done
but this is extremely slow, order of 10 seconds or more per file.
Andrea, you need to use this query
summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721
and it will provide you with summary about dataset
Number of blocks: 2, Number of events: 2290740, Number of files: 326, Number of
lumis: 414, Sum(file_size): 855.7GB
The reason we end-up with dedicated query is structure of DBS APIs. The one you're looking for didn't look-up file attributes and only provided file names. The DBS folks created another query which I used in DAS to provide summary for datasets. Please review and close the ticket. Best, Valentin.
On 0, Andrea Bocci notifications@github.com wrote:
Hi, I would like to be able to run a query to get the number of events in one or more datasets, for a given run. I would guess something like this
file dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | sum(file.nevents)
but it clearly does not work, as I get
sum(file.nevents)=N/A
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247
hi Valentin, thanks, indeed
summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | grep summary.nevents
works as expected.
Would it be possible to support also something like
summary dataset=/*/Run2015D-v1/RAW run=259721 | grep summary.nevents
?
Andrea, No, patterns are not acceptable for summary queries. It is DBS policies since patterns will lead to full table scan on most populated tables (dataset, block, files) and user usually provide very poor patterns, e.g. your example exactly shows that your query will force to scan all datasets, blocks, files. Best, Valentin.
On 0, Andrea Bocci notifications@github.com wrote:
hi Valentin, thanks, indeed
summary dataset=/ZeroBias1/Run2015D-v1/RAW run=259721 | grep summary.nevents
works as expected.
Would it be possible to support also something like
summary dataset=/*/Run2015D-v1/RAW run=259721 | grep summary.nevents
?
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181847745
But then I have to do exactly the same scan "by hand" in a script, which will result in the same (or higher) load on DBS, no ?
No, the load is not the same. The full table scan needs to merge in memory O(100K) datasets, O(10-100M) files, O(1M) blocks in order to answer your query. While if you provide full dataset paths, the DBS should have 1 dataset + N blocks + N files. As you can see these are WAY different order of magnitude. Sending multiple queries to DBS is not the same as sending 1 query which need to do all merging. And from your example you only need 4 datasets: /ZeroBias{1..4}/Run2015D-v1/RAW and may be some runs. Instead of your original 3 for loops, now you only need 2 loops, one for run and one for summary. No files involved.
On 0, Andrea Bocci notifications@github.com wrote:
But then I have to do exactly the same scan "by hand" in a script, which will result in the same (or higher) load on DBS, no ?
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181855220
Thanks, I always appreciate when people answer my feature requests telling me what I need instead.
DAS could be made exactly as smart as it needs to be to do the "limited" query. If I query DAS with
summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]
it has the exact same information I have, and can use to
Actually, DAS could probably combine the first two steps, if it makes the overall query faster.
Yes, I can implement the same logic in a client-side script. But please do not try to tell me that doing it client side is faster and poses less load on the system.
.Andrea
Andrea, I already made DAS smart enough about fetching data from different DBs, data-services, managing to handle different data-formats, notations, etc. And yet users always want more :) But what you outline leads so simple problem, time. And since your client is using HTTP connection it can simply timeout when DAS will do your smart thing. Doing this in client is equivalent of doing this in DAS server, but
But, DAS supports summary queries with run ranges, e.g. summary dataset=/SingleMu/Run2011B-WMu-19Nov2011-v1/RAW-RECO run in [177718, 177053] works just fine.
The only missing part is pattern in dataset name, which basically means for you to have only one loop, not even two.
On 0, Andrea Bocci notifications@github.com wrote:
Thanks, I always appreciate when people answer my feature requests telling me what I need instead.
DAS could be made exactly as smart as it needs to be to do the "limited" query. If I query DAS with
summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]
it has the exact same information I have, and can use to
- first make query to get the list of matching datasets
- then make a query to restrict the datasets to those available for those runs
- finally, make the "summary" request
Actually, DAS could probably combine the first two steps, if it makes the overall query faster.
Yes, I can implement the same logic in a client-side script. But please do not try to tell me that doing it client side is faster and poses less load on the system.
.Andrea
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181911957
Hi guys, just wanted to add my 2ct: This is indeed a problem that I've came across more often: getting a list of dataset along with number of events for a pattern. I would already be happy, if there was a section on the FAQ or in the twiki with a minimal example. Some little script, that I paste into my shell. How about that?
EDIT: Thanks, of course, for the DAS development, it's an awesome tool! =)
Heiner, it really depends on end-user since I don't know how you organize your work, do you work with bash, tcsh, python, web, whatever else?
I gave you a recipe
perform query to find list of dataset, e.g. dataset=/your_pattern_here
das_client.py --query="dataset=/your_pattern_here"
place another set of queries with you favorite runs, e.g. summary dataset=/a/b/c run in [1,2,3] summary dataset=/c/d/e run in [2,3,4]
das_client.py --query="summary dataset=/a/b/c run in [1,2,3]"
Please note that for step #2 you can either use das_client.py as is, you can use --JSON option to get back data in JSON format.
You can write your steps #1,2,3 in bash, tcsh, python. You may be interested to look-up data from different instances as well and therefore your query will need to be modified accordingly to use instance=XXX clause.
So, your mileage is vary as you go and I don't think it is up to me to write for you your scripts.
But you can even bypass DAS entirely to avoid its latency and perform all steps against DBS directly using DBS APIs. For instance, here is I'll do it using curl
It really depends on what you want to do and how you want to do it.
On 0, HeinerTholen notifications@github.com wrote:
Hi guys, just wanted to add my 2ct: This is indeed a problem that I've came across more often: getting a list of dataset along with number of events for a pattern. I would already be happy, if there was a section on the FAQ or in the twiki with a minimal example. Some little script, that I paste into my shell. How about that?
Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-181925730
resolution has been provided in terms of multiple query look-up
which is just a way of saying to the users "you are on your own"
Andrea, you can draw your own conclusion, but implementing full stack of steps in DAS is not reliable due to simple reason, the latency. If I take your example:
summary dataset=/ZeroBias*/Run2015D-v1/RAW run in [258425, 259626, 259721, 260627]
and implement all steps, most likely such query will take minutes and timeout. The problem is that user may supply very loose query in first item, e.g. /*/*/RAW
. This means that DAS will need to get N datasets from DBS (where N is large), then loop over them and send as many as N requests again to get the summary. As much as I want to implement this I can't to avoid making DDoS attack on DBS because users are LAZY to know their datasets.
Maybe users are not lazy, they would simply like to know ow many events are available in a given list of datasets for a given list of runs ?
Yes, I can wrap access to DAS in some kind of script:
DS="/ZeroBias*/Run2015D-v1/RAW"
RUNS="258425, 259626, 259721, 260627"
for D in $(das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0); do
printf "%s\t%s\n" $D $(das_client.py --limit 0 --query "summary dataset=$D run in [$RUNS] | grep summary.nevents")
done
to get my answer:
/ZeroBias/Run2015D-v1/RAW 3746654
/ZeroBias1/Run2015D-v1/RAW 9768630
/ZeroBias2/Run2015D-v1/RAW 9769884
/ZeroBias3/Run2015D-v1/RAW 9770423
/ZeroBias4/Run2015D-v1/RAW 9770209
While I still don't see why it is OK if I do this, and it is a problem if DAS does it for me, there is probably some deep issue I don't understand.
(yes, it is rather slow, though I guess it could be done much faster if I ran all but the first das query in parallel)
Andrea,
what if a user will do this: summary dataset=/*/*/*RAW run in [RUNS]
this is the same type of query you have, i.e. it contains a pattern, but
pattern IS VERY LOOSE. In your case we need to look-up of 5 datasets, but
my example we'll have 19131 datasets instead of 5. Now if your query is slow
when you run you script imaging how slow will be to look-up N-thousand
datasets.
The point is that wild-card is hard to support for generic case, and from support point of view we should either support it or not.
Since I can't guarantee that users will put loose patterns I can't permit such queries on a server.
That's why I'm resistant to implement the logic in a server. You may argue or even assure that you'll be careful, etc., but we can't guarantee that in some instances user will not place a loose query which will cause server to stuck. Therefore I still think that it is should be done on a client side rather a server. Clients know what they are doing explicitly and we can trace down such clients and help and fix issues.
Best, Valentin.
On 0, Andrea Bocci notifications@github.com wrote:
Maybe users are not lazy, they would simply like to know ow many events are available in a given list of datasets for a given list of runs ?
Yes, I can wrap access to DAS in some kind of script:
DS="/ZeroBias*/Run2015D-v1/RAW" RUNS="258425, 259626, 259721, 260627" for D in $(das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0); do printf "%s\t%s\n" $D $(das_client.py --limit 0 --query "summary dataset=$D run in [$RUNS] | grep summary.nevents") done
to get my answer:
/ZeroBias/Run2015D-v1/RAW 3746654 /ZeroBias1/Run2015D-v1/RAW 9768630 /ZeroBias2/Run2015D-v1/RAW 9769884 /ZeroBias3/Run2015D-v1/RAW 9770423 /ZeroBias4/Run2015D-v1/RAW 9770209
While I still don't see why it is OK if I do this, and it is a problem if DAS does it for me, there is probably some deep issue I don't understand.
-- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DAS/issues/4247#issuecomment-270678015
Actually the first check das_client.py --query "dataset dataset=$DS run in [$RUNS]" --limit 0
would still limit the following query to just 54 datasets, but I see your point.
I'm glad you understand the issue. So far I don't see how I can resolve this on server side without putting it at risk for wild-card queries. I rather think we need to start developing series of tools to address such issues on a client side. I'm working on this direction to use DBS on HDFS snapshot where such aggregation can be done. And, I put this use case to my project list (https://gist.github.com/vkuznet/2372055e3a4b72731ddfa06882efd366)
Here, I'm closing the ticket with output I already provided, i.e. it can be done as series of queries at a client side.
The new project I'm aiming to delegate this issue is titled as DBS/PhEDEx aggregation using Hadoop+Spark platform on gist.
Hi, I would like to be able to run a query to get the number of events in one or more datasets, for a given run. I would guess something like this
but it clearly does not work, as I get