Closed marcorossi5 closed 4 years ago
Many thanks for this.
I've just tried and I got:
a_q2 = tf.math.log(aa_q2, name="logq2")
File "/home/carrazza/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 5244, in log
_ops.raise_from_not_ok_status(e, name)
File "/home/carrazza/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: Could not find valid device for node.
Node:{{node Log}}
so I imagine a missing cast around aa_q2
. After fixing that I get:
pdf.xfxQ2(0, np.array([0.1]), np.array([2]))
File "/home/carrazza/repo/n3pdf/pdfflow/src/pdfflow/pflow.py", line 123, in xfxQ2
f_f = self._xfxQ2(u, a_x, a_q2)
...
tensorflow.python.framework.errors_impl.InvalidArgumentError: Indices and updates specified for empty output shape
[[{{node cond_1/else/_21/cond_12/else/_315/ScatterNd}}]] [Op:__inference_first_subgrid_26649]
Function call stack:
first_subgrid
I think you branched out from some PR before it was merged into master? I'd suggest taking the latest 3 commits as a patch, going back to master and branching from there.
Update: I am looking at the commit tree and I'm confused.
Anyway, @scarrazza where do you get the error? The singletop_lo.py
is working for me.
I think so. I created a new branch from origin/realexample. Is it wrong doing this?
Also I got the same error a couple of times in the past. I don't know why and furthermore after re-running the error fixed itself. I recall I got really confused about that.
I think so. I created a new branch from origin/realexample. Is it wrong doing this?
Probably there were some changes done to that branch before it was merged to master and you didn't have the changes locally. Or some of those changes didn't go into master. In any case, you branched from a past that no longer exists very sci-fi but a pain with git :P
Try to merge master into this branch, maybe the differences are not big enough to be a problem and there are no conflicts.
I show you my local tree so I am sure to understand. Do I miss any commit in the tree here? If it is not the case, can't we just rebase master onto my last commit?
Oh, wait, I think I'm starting to understand what's happening. One moment.
Ok, so I think there was a mix up where realexample was merged to master before fixed was merged to realexample. Now everything is like it should be.
Ok, I have retried now, here some highlights:
these are the numbers I am getting for the FK evaluation:
loading from file time (s): 74.13009548187256
dry run time (s): 96.14635920524597
total FK evaluation time (s): 15.626785039901733
The dry run is "perfect", dropping from 2k seconds to 96s, however the evaluation time is now slower than LHAPDF (13.7s).
Another small issue is that I cannot call a single flavour any more:
a.xfxQ2([0], np.array([0.1]), np.array([2.0]))
# or
a.xfxQ2(0, np.array([0.1]), np.array([2.0]))
The minimum setup needs a list with at least 2 flavours.
I have the feeling that the retracing is now takes much more time than before, e.g. the single top example now takes 183.27s instead of 16.8s.
Regarding the speed of the algorithm I have an idea: each pdf is processed sequentially by subgrid. This is achieved in pflow.py from line 81 to line 107. Although these steps must be kept separate because the algorithm is slightly different, they are fully independent. Is there a way to run them together, like spawning different subrprocesses and execute them independently? This would make everything faster provided that we have enough memory to do that.
I don't know why you are having the second issue: I run compare_accuracy_lhapdf.py and that takes just pid=21 as input and it works properly.
The code could be made cleaner for sure with less operations. Particularly ensuring that we are not wasting time/resources in tf.cond as @scarlehoff already pointed out.
Concerning 2, the problem is with 0 (21 works for me) only on this branch, master works fine.
Ok, I wasn't understanding. It's just an interface problem. It's because pdfflow searches 0 through the flavor scheme given in the pdf file, but can't find it. Then it returns an empty tensor.
Should I allow querying a pid=0 value with a line such:
if pid == 0: pid=21
inside pdfflow algorithm?
yes, I think this is a good idea otherwise many applications will crash with the gluon == 0.
Fine, I'll implement and commit soon. Do you have any comment about my idea in point one?
Concerning point one, I understand the idea but still not clear to me why this PR is slower than master, both branches are querying PDFs sequentially for the moment, right?
Yes, they do. I think then that point 1 and 3 are the same. Regarding the fk evaluation we had approximately 22x improvement in the dry run time which is fine, but a 7x worsening in the fk evaluation. Concerning the single top example we have roughly a 10x worsening which is consistent somehow with the 7x before. I am struggling to justify how little changes in the algorithm can cause such large effects, but I'll try to figure out.
I realized that if input signature is not specified for every nested function called by the algorithm, tf.function keeps retracing them. However I couldn't provide an input signature for the act_on_empty function since its arguments are callable functions, not tensors. @scarlehoff , do you have some hints on how to prevent retracing to many times there?
In conclusion, this would speed up the dry run. Give it a try please. Yet, no improvements on the execution side though.
Sorry for the delay. The 0 is now working perfectly.
Here the numbers for the latest commit:
loading time (s): 76.667320728302
dry run time (s): 85.03288006782532
total time (s): 14.76063323020935
The dry run is now slightly faster.
@marcorossi5, @scarlehoff so summarizing, here we need to reduce the evaluation time to the original values. Do you agree?
Yes, but it's difficult to say where exactly where are having delays. The code is in fact really similar between the two versions to justify such a worsening
I made a little improvement.
I found that using tf.where
performs better than manually selecting the two branches of an if conditional and using the act_on_empty
function.
I'll try to apply this method wherever is possible to speed up things.
Unfortunately I see that checks for this version have failed, but I don't know why. It works on my laptop.
The pytest error seems an issue with packages and imports. I saw also that during the test conda tries to install tf==2.1.0, but it happens that they released version 2.2.0 which is way faster. On my laptop it cut down the compare_accuracy_lhapdf.py execution time by 30%. How can we fix this error?
I guess the problem is tensorflow-probability
, let me do a quick test.
I made a little change. @scarrazza can you try rerunning the FK evaluation please?
Thanks, now it takes 2s less:
loading time (s): 69.30477404594421
dry run time (s): 71.28993844985962
total time (s): 12.264929056167603
How much does lhapdf take in comparison?
I changed the algorithm. Can you try running the fk and see if it works better now? Thanks
Thanks, I have updated tf to 2.2 and now the dry run is really fast, but the evaluation time still similar to the previous one:
loading time (s): 89.9633617401123
dry run time (s): 16.857575178146362
total time (s): 11.823127031326294
Is it fine for me to go through the code? (don't want to change anything that could create a ton of conflicts if you have incoming commits!)
Yes, please.
The problem with single_top seems to be due to xfxQ2
which needs to be tf.function
compilable (and wasn't). I guess this will introduce some overhead but it's a start. I'll have a look through the rest of code this afternoon after teaching.
Now compilation takes longer I guess, but it was expected.
For now I did some benchmarks in my computer with the state of the latest commit.
VEGAS MC, ncalls=5000000:
Result for iteration 0: 55.5637 +/- 0.1984(took 24.69133 s)
Result for iteration 1: 55.5595 +/- 0.0255(took 1.09133 s)
Result for iteration 2: 55.5271 +/- 0.0167(took 1.25155 s)
Result for iteration 3: 55.5253 +/- 0.0156(took 1.23854 s)
Result for iteration 4: 55.5240 +/- 0.0154(took 1.23973 s)
> Final results: 55.5293 +/- 0.00859928
Memory: 4.5 GB
I also did a check substituting the PDF with a random number generator to see what the overhead of pdfflow was.
VEGAS MC, ncalls=5000000:
Result for iteration 0: 2252.3390 +/- 87.2739(took 11.89053 s)
Result for iteration 1: 2247.0104 +/- 6.9788(took 0.63306 s)
Result for iteration 2: 2246.6553 +/- 1.6158(took 0.65649 s)
Result for iteration 3: 2243.1900 +/- 1.2843(took 0.62217 s)
Result for iteration 4: 2245.6375 +/- 1.2404(took 0.60715 s)
> Final results: 2245 +/- 0.776164
Memory: 3.0 GB
The overhead in time seems to grow with the number of events (I guess it's memory driven).
The pdf computation stuff takes more or less as long as the single_top
LO which I'd say it is not that bad (if it is truly memory driven and we get the memory down it means you can do even more points at once, right now my GPU suffers above 1e6
at once).
It is hard to compare with lhapdf, would need to benchmark against the C++ code, @scarrazza if you can upload some benchmark of the single top, LO with your C++ code (which I guess uses LHAPDF) it would be very helpful.
My uninformed guess would be that these numbers we have are not bad. 5e6 events in 1 second seems good. The fairer comparison I can think of doing with my computer is some fortran code for Vector Boson Fusion at LO. This takes (per iteration) for the same number of events, ~40 seconds. Being Fortran the memory usage is ~50 MB but meh.
Thanks, here my numbers for C++ with LHAPDF and GSL single thread:
GSL-VEGAS stage:0 ncalls:1000000 = ( 5.560226e+01 +- 1.998375e-01 = 0.35941% ) pb
GSL-VEGAS stage:1 ncalls:1000000 = ( 5.551466e+01 +- 8.194880e-03 = 0.01476% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551506e+01 +- 4.166962e-03 = 0.00751% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551443e+01 +- 3.059669e-03 = 0.00551% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551723e+01 +- 2.516116e-03 = 0.00453% ) pb
[timer::stop] success: elapsed time 57.700657 seconds
If you remember the aMC@NLO numbers for sure 1s is ultra fast.
Out of curiosity @marcorossi5 @scarrazza does the function signatures change much the performance of the code in some situations?
I'd say the least arguments for a @tf.function decorated function, the best. But it depends what's the price you pay for passing fewer arguments. If this leads to a lot of ops then to calculate again what could have been passed by argument, this may not worth the price. It's a kind of a tradeoff. But this is what I found with pdfflow by myself. Nothing theoretically rigorous of course.
Mm, in principle it should only matter for the first calculation (which does compile). If retracing is triggered beyond that there are bigger problems.
I'm asking because in GPU I am seeing actually better performance if I don't assign the signature (in exchange for a few extra seconds of compilation). I'd be partial to their removal because of that and because I think it makes the code more prone to future problems (also, it seems tf is moving towards having flexible shapes, which might mean it is better now with 2.2 at figuring out the clever signatures than it was before).
They should ensure retracing won't be triggered after the first iteration. In principle I think that they could be taken away.
I made a little test. The signature is necessary in fact. Because if a pass a tensor of shape [x] as an argument the first time a retrace is triggered. If there is no input_signature specified and I pass a tensor of shape [x+1], then another retrace is triggered in any case. Instead if I specify input_signature as [None], then whatever shape is passed doesn't retrace
I made a little test. The signature is necessary in fact. Because if a pass a tensor of shape [x] as an argument the first time a retrace is triggered. If there is no input_signature specified and I pass a tensor of shape [x+1], then another retrace is triggered in any case. Instead if I specify input_signature as [None], then whatever shape is passed doesn't retrace
Mm, it's tricky. I would argue that as long as there is memory to spare two retracings there are better than one if you are regularly calling the functions with two different (but equal) tensors shapes.
More generally I guess the right thing to do is isolate the parts that can be called with more than one shape and ensure it doesn't happen. In any case let's leave them in for now.
Now for organizational purposes, I would say we should merge this PR (the code does what is supposed to do and it seems fast enough in the LO test). I think short term tasks (optimization, implementation of extra features, fixing bugs or whatever) are better suited in separate PRs.
Ok for merging
I changed the algorithm to prevent retracing. Now the graph is built just once and for all. Either across multireplicas within a pdf set, either across different pdfsets. Just to make an example: I loaded in the same program first NNPDF31_nlo_as_0118/0 and then MSTW2008nnlo90cl/0. On my laptop it took 61 s to make the first run and 8 s for the second one. Can you try this implementation with the fk script? If you don't find any bug, I propose to merge. Let me know, thanks.