Multirep - Githubissues

marcorossi5 commented 4 years ago

I changed the algorithm to prevent retracing. Now the graph is built just once and for all. Either across multireplicas within a pdf set, either across different pdfsets. Just to make an example: I loaded in the same program first NNPDF31_nlo_as_0118/0 and then MSTW2008nnlo90cl/0. On my laptop it took 61 s to make the first run and 8 s for the second one. Can you try this implementation with the fk script? If you don't find any bug, I propose to merge. Let me know, thanks.

scarrazza commented 4 years ago

Many thanks for this.

I've just tried and I got:

    a_q2 = tf.math.log(aa_q2, name="logq2")
  File "/home/carrazza/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 5244, in log
    _ops.raise_from_not_ok_status(e, name)
  File "/home/carrazza/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.NotFoundError: Could not find valid device for node.
Node:{{node Log}}

so I imagine a missing cast around aa_q2. After fixing that I get:

    pdf.xfxQ2(0, np.array([0.1]), np.array([2]))
  File "/home/carrazza/repo/n3pdf/pdfflow/src/pdfflow/pflow.py", line 123, in xfxQ2
    f_f = self._xfxQ2(u, a_x, a_q2)
...
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Indices and updates specified for empty output shape
     [[{{node cond_1/else/_21/cond_12/else/_315/ScatterNd}}]] [Op:__inference_first_subgrid_26649]

Function call stack:
first_subgrid

scarlehoff commented 4 years ago

I think you branched out from some PR before it was merged into master? I'd suggest taking the latest 3 commits as a patch, going back to master and branching from there.

Update: I am looking at the commit tree and I'm confused.

Anyway, @scarrazza where do you get the error? The singletop_lo.py is working for me.

marcorossi5 commented 4 years ago

I think so. I created a new branch from origin/realexample. Is it wrong doing this?

Also I got the same error a couple of times in the past. I don't know why and furthermore after re-running the error fixed itself. I recall I got really confused about that.

scarlehoff commented 4 years ago

I think so. I created a new branch from origin/realexample. Is it wrong doing this?

Probably there were some changes done to that branch before it was merged to master and you didn't have the changes locally. Or some of those changes didn't go into master. In any case, you branched from a past that no longer exists very sci-fi but a pain with git :P

Try to merge master into this branch, maybe the differences are not big enough to be a problem and there are no conflicts.

marcorossi5 commented 4 years ago

Screenshot from 2020-04-20 21-43-56

I show you my local tree so I am sure to understand. Do I miss any commit in the tree here? If it is not the case, can't we just rebase master onto my last commit?

scarlehoff commented 4 years ago

Oh, wait, I think I'm starting to understand what's happening. One moment.

scarlehoff commented 4 years ago

Ok, so I think there was a mix up where realexample was merged to master before fixed was merged to realexample. Now everything is like it should be.

scarrazza commented 4 years ago

Ok, I have retried now, here some highlights:

these are the numbers I am getting for the FK evaluation:
```
loading from file time (s): 74.13009548187256
dry run time (s): 96.14635920524597
total FK evaluation time (s): 15.626785039901733
```
The dry run is "perfect", dropping from 2k seconds to 96s, however the evaluation time is now slower than LHAPDF (13.7s).
Another small issue is that I cannot call a single flavour any more:
```
a.xfxQ2([0], np.array([0.1]), np.array([2.0]))
# or 
a.xfxQ2(0, np.array([0.1]), np.array([2.0]))
```
The minimum setup needs a list with at least 2 flavours.
I have the feeling that the retracing is now takes much more time than before, e.g. the single top example now takes 183.27s instead of 16.8s.

marcorossi5 commented 4 years ago

Regarding the speed of the algorithm I have an idea: each pdf is processed sequentially by subgrid. This is achieved in pflow.py from line 81 to line 107. Although these steps must be kept separate because the algorithm is slightly different, they are fully independent. Is there a way to run them together, like spawning different subrprocesses and execute them independently? This would make everything faster provided that we have enough memory to do that.
I don't know why you are having the second issue: I run compare_accuracy_lhapdf.py and that takes just pid=21 as input and it works properly.
The code could be made cleaner for sure with less operations. Particularly ensuring that we are not wasting time/resources in tf.cond as @scarlehoff already pointed out.

scarrazza commented 4 years ago

Concerning 2, the problem is with 0 (21 works for me) only on this branch, master works fine.

marcorossi5 commented 4 years ago

Ok, I wasn't understanding. It's just an interface problem. It's because pdfflow searches 0 through the flavor scheme given in the pdf file, but can't find it. Then it returns an empty tensor. Should I allow querying a pid=0 value with a line such: if pid == 0: pid=21 inside pdfflow algorithm?

scarrazza commented 4 years ago

yes, I think this is a good idea otherwise many applications will crash with the gluon == 0.

marcorossi5 commented 4 years ago

Fine, I'll implement and commit soon. Do you have any comment about my idea in point one?

scarrazza commented 4 years ago

Concerning point one, I understand the idea but still not clear to me why this PR is slower than master, both branches are querying PDFs sequentially for the moment, right?

marcorossi5 commented 4 years ago

Yes, they do. I think then that point 1 and 3 are the same. Regarding the fk evaluation we had approximately 22x improvement in the dry run time which is fine, but a 7x worsening in the fk evaluation. Concerning the single top example we have roughly a 10x worsening which is consistent somehow with the 7x before. I am struggling to justify how little changes in the algorithm can cause such large effects, but I'll try to figure out.

marcorossi5 commented 4 years ago

I realized that if input signature is not specified for every nested function called by the algorithm, tf.function keeps retracing them. However I couldn't provide an input signature for the act_on_empty function since its arguments are callable functions, not tensors. @scarlehoff , do you have some hints on how to prevent retracing to many times there?

In conclusion, this would speed up the dry run. Give it a try please. Yet, no improvements on the execution side though.

scarrazza commented 4 years ago

Sorry for the delay. The 0 is now working perfectly.

Here the numbers for the latest commit:

loading time (s): 76.667320728302
dry run time (s): 85.03288006782532
total time (s): 14.76063323020935

The dry run is now slightly faster.

scarrazza commented 4 years ago

@marcorossi5, @scarlehoff so summarizing, here we need to reduce the evaluation time to the original values. Do you agree?

marcorossi5 commented 4 years ago

Yes, but it's difficult to say where exactly where are having delays. The code is in fact really similar between the two versions to justify such a worsening

marcorossi5 commented 4 years ago

I made a little improvement. I found that using tf.where performs better than manually selecting the two branches of an if conditional and using the act_on_empty function. I'll try to apply this method wherever is possible to speed up things. Unfortunately I see that checks for this version have failed, but I don't know why. It works on my laptop.

marcorossi5 commented 4 years ago

The pytest error seems an issue with packages and imports. I saw also that during the test conda tries to install tf==2.1.0, but it happens that they released version 2.2.0 which is way faster. On my laptop it cut down the compare_accuracy_lhapdf.py execution time by 30%. How can we fix this error?

scarlehoff commented 4 years ago

I guess the problem is tensorflow-probability, let me do a quick test.

marcorossi5 commented 4 years ago

I made a little change. @scarrazza can you try rerunning the FK evaluation please?

scarrazza commented 4 years ago

Thanks, now it takes 2s less:

loading time (s): 69.30477404594421
dry run time (s): 71.28993844985962
total time (s): 12.264929056167603

marcorossi5 commented 4 years ago

How much does lhapdf take in comparison?

marcorossi5 commented 4 years ago

I changed the algorithm. Can you try running the fk and see if it works better now? Thanks

scarrazza commented 4 years ago

Thanks, I have updated tf to 2.2 and now the dry run is really fast, but the evaluation time still similar to the previous one:

loading time (s): 89.9633617401123
dry run time (s): 16.857575178146362
total time (s): 11.823127031326294

scarlehoff commented 4 years ago

Is it fine for me to go through the code? (don't want to change anything that could create a ton of conflicts if you have incoming commits!)

marcorossi5 commented 4 years ago

Yes, please.

scarlehoff commented 4 years ago

The problem with single_top seems to be due to xfxQ2 which needs to be tf.function compilable (and wasn't). I guess this will introduce some overhead but it's a start. I'll have a look through the rest of code this afternoon after teaching.

Now compilation takes longer I guess, but it was expected.

For now I did some benchmarks in my computer with the state of the latest commit.

VEGAS MC, ncalls=5000000:
Result for iteration 0: 55.5637 +/- 0.1984(took 24.69133 s)
Result for iteration 1: 55.5595 +/- 0.0255(took 1.09133 s)
Result for iteration 2: 55.5271 +/- 0.0167(took 1.25155 s)
Result for iteration 3: 55.5253 +/- 0.0156(took 1.23854 s)
Result for iteration 4: 55.5240 +/- 0.0154(took 1.23973 s)
 > Final results: 55.5293 +/- 0.00859928

Memory: 4.5 GB

I also did a check substituting the PDF with a random number generator to see what the overhead of pdfflow was.

VEGAS MC, ncalls=5000000:
Result for iteration 0: 2252.3390 +/- 87.2739(took 11.89053 s)
Result for iteration 1: 2247.0104 +/- 6.9788(took 0.63306 s)
Result for iteration 2: 2246.6553 +/- 1.6158(took 0.65649 s)
Result for iteration 3: 2243.1900 +/- 1.2843(took 0.62217 s)
Result for iteration 4: 2245.6375 +/- 1.2404(took 0.60715 s)
 > Final results: 2245 +/- 0.776164

Memory: 3.0 GB

The overhead in time seems to grow with the number of events (I guess it's memory driven).

The pdf computation stuff takes more or less as long as the single_top LO which I'd say it is not that bad (if it is truly memory driven and we get the memory down it means you can do even more points at once, right now my GPU suffers above 1e6 at once).

It is hard to compare with lhapdf, would need to benchmark against the C++ code, @scarrazza if you can upload some benchmark of the single top, LO with your C++ code (which I guess uses LHAPDF) it would be very helpful.

My uninformed guess would be that these numbers we have are not bad. 5e6 events in 1 second seems good. The fairer comparison I can think of doing with my computer is some fortran code for Vector Boson Fusion at LO. This takes (per iteration) for the same number of events, ~40 seconds. Being Fortran the memory usage is ~50 MB but meh.

scarrazza commented 4 years ago

Thanks, here my numbers for C++ with LHAPDF and GSL single thread:

GSL-VEGAS stage:0 ncalls:1000000 = ( 5.560226e+01 +- 1.998375e-01 = 0.35941% ) pb
GSL-VEGAS stage:1 ncalls:1000000 = ( 5.551466e+01 +- 8.194880e-03 = 0.01476% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551506e+01 +- 4.166962e-03 = 0.00751% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551443e+01 +- 3.059669e-03 = 0.00551% ) pb
GSL-VEGAS stage:2 ncalls:1000000 = ( 5.551723e+01 +- 2.516116e-03 = 0.00453% ) pb
[timer::stop] success: elapsed time 57.700657 seconds

If you remember the aMC@NLO numbers for sure 1s is ultra fast.

scarlehoff commented 4 years ago

Out of curiosity @marcorossi5 @scarrazza does the function signatures change much the performance of the code in some situations?

marcorossi5 commented 4 years ago

I'd say the least arguments for a @tf.function decorated function, the best. But it depends what's the price you pay for passing fewer arguments. If this leads to a lot of ops then to calculate again what could have been passed by argument, this may not worth the price. It's a kind of a tradeoff. But this is what I found with pdfflow by myself. Nothing theoretically rigorous of course.

scarlehoff commented 4 years ago

Mm, in principle it should only matter for the first calculation (which does compile). If retracing is triggered beyond that there are bigger problems.

I'm asking because in GPU I am seeing actually better performance if I don't assign the signature (in exchange for a few extra seconds of compilation). I'd be partial to their removal because of that and because I think it makes the code more prone to future problems (also, it seems tf is moving towards having flexible shapes, which might mean it is better now with 2.2 at figuring out the clever signatures than it was before).

marcorossi5 commented 4 years ago

They should ensure retracing won't be triggered after the first iteration. In principle I think that they could be taken away.

marcorossi5 commented 4 years ago

I made a little test. The signature is necessary in fact. Because if a pass a tensor of shape [x] as an argument the first time a retrace is triggered. If there is no input_signature specified and I pass a tensor of shape [x+1], then another retrace is triggered in any case. Instead if I specify input_signature as [None], then whatever shape is passed doesn't retrace

scarlehoff commented 4 years ago

I made a little test. The signature is necessary in fact. Because if a pass a tensor of shape [x] as an argument the first time a retrace is triggered. If there is no input_signature specified and I pass a tensor of shape [x+1], then another retrace is triggered in any case. Instead if I specify input_signature as [None], then whatever shape is passed doesn't retrace

Mm, it's tricky. I would argue that as long as there is memory to spare two retracings there are better than one if you are regularly calling the functions with two different (but equal) tensors shapes.

More generally I guess the right thing to do is isolate the parts that can be called with more than one shape and ensure it doesn't happen. In any case let's leave them in for now.

Now for organizational purposes, I would say we should merge this PR (the code does what is supposed to do and it seems fast enough in the LO test). I think short term tasks (optimization, implementation of extra features, fixing bugs or whatever) are better suited in separate PRs.

marcorossi5 commented 4 years ago

Ok for merging

N3PDF / pdfflow

Multirep #20