etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

fine tune the hierarchichal timing infrastructure #506

Closed kostrzewa closed 2 years ago

kostrzewa commented 2 years ago

replaces #505

kostrzewa commented 2 years ago

I'm working on resolving the conflicts and adjusting some of the output, I hope that's okay.

kostrzewa commented 2 years ago

@Marcogarofalo after some thought and some exploration of the data.tree package I've made a change which may or may not be useful. It significantly worsens the "human readability" of the output but it allows the output to be fed into a tree-based data structure without any additional logic. In the analysis one can do:

raw_data <- system(paste("grep \"Time for\"",
                         args[1],
                         "| grep level",
                         "| awk '{print $2 \" \" $5 \" \" $6 \" \" $9 \" \" $12}'"),
                   intern = TRUE)

data <- read.table(text = raw_data, stringsAsFactors = FALSE)
colnames(data) <- c("prefix", "name", "time", "level", "pathString")

data <- dplyr::mutate(data, pathString = sprintf("HMC%s", pathString)) %>%
        dplyr::group_by(pathString) %>%
        dplyr::summarise(time = sum(time),
                         prefix = unique(prefix),
                         name = unique(name),
                         level = unique(level)) %>%
        dplyr::ungroup()

t_tree <- data.tree::as.Node(data)

which produces the following kind of representation:

> print(t_tree, "time", "prefix")
                                           levelName         time   prefix
1   HMC                                                        NA         
2    ¦--cloverdetlight:cloverdet_heatbath             0.276630200        :
3    ¦   ¦--Qp                                        0.039110120        :
4    ¦   ¦--random_energy0                            0.066529490        :
5    ¦   ¦--sw_invert                                 0.069037080        :
6    ¦   °--sw_term                                   0.101852300        :
7    ¦--cloverdetratio1light:cloverdetratio_heatbath 34.091170000        :
8    ¦   ¦--Qp_zero_pf                                0.031062650        :
9    ¦   ¦--random_energy0                            0.069898270        :
10   ¦   ¦--solve_degenerate                         33.826040000        :
11   ¦   ¦   ¦--gamma5                                0.000388851        :
12   ¦   ¦   °--invert_eo_degenerate_quda            33.825010000 TM_QUDA:
13   ¦   ¦       ¦--invertQuda                       13.625890000 TM_QUDA:
14   ¦   ¦       ¦--MG_Preconditioner_Setup          20.195500000 TM_QUDA:
15   ¦   ¦       ¦--reorder_spinor_eo_fromQuda        0.001913765 TM_QUDA:
16   ¦   ¦       °--reorder_spinor_eo_toQuda          0.001588084 TM_QUDA:
17   ¦   ¦--sw_invert                                 0.063206370        :
18   ¦   °--sw_term                                   0.100872100        :
19   ¦--cloverdetratio2light:cloverdetratio_heatbath  2.513363000        :
20   ¦   ¦--Qp_zero_pf                                0.036231330        :
21   ¦   ¦--random_energy0                            0.067319410        :
22   ¦   ¦--solve_degenerate                          2.242713000        :
23   ¦   ¦   ¦--gamma5                                0.000419401        :
24   ¦   ¦   °--invert_eo_degenerate_quda             2.241631000 TM_QUDA:
25   ¦   ¦       ¦--invertQuda                        2.238565000 TM_QUDA:
26   ¦   ¦       ¦--reorder_spinor_eo_fromQuda        0.002052576 TM_QUDA:
27   ¦   ¦       °--reorder_spinor_eo_toQuda          0.000918682 TM_QUDA:
28   ¦   ¦--sw_invert                                 0.064552320        :
29   ¦   °--sw_term                                   0.102456200        :
30   ¦--CLOVERNDTRLOG:clovernd_trlog_heatbath         0.127413100        :
31   ¦   ¦--sw_term                                   0.102629100        :
32   ¦   °--sw_trace_nd                               0.024746180        :
33   ¦--CLOVERTRLOG:clover_trlog_heatbath             0.120279500        :
34   ¦   ¦--sw_term                                   0.103729000        :
35   ¦   °--sw_trace                                  0.013745540        :
36   ¦--GAUGE:gauge_heatbath                          0.067569350        :
37   ¦--ndcloverrat_0_3:init_ndrat_monomial           0.000018670        :
38   ¦--ndcloverrat_0_3:ndrat_heatbath                2.065281000        :
39   ¦   ¦--random_energy0                            0.135540100        :
40   ¦   ¦--solve_mms_nd_plus                         1.748963000        :
41   ¦   ¦   ¦--f_assign                              0.252822900        :
42   ¦   ¦   °--solve_mms_nd                          1.496015000        :
43   ¦   ¦       ¦--gamma5                            0.002436816        :
44   ¦   ¦       ¦--invert_eo_quda_twoflavour_mshift  1.487874000 TM_QUDA:
45   ¦   ¦       ¦   ¦--invertMultiShiftQuda          0.416139500 TM_QUDA:
46   ¦   ¦       ¦   ¦--loadCloverQuda                0.021220480 TM_QUDA:
47   ¦   ¦       ¦   ¦--loadGaugeQuda                 0.017785190 TM_QUDA:
48   ¦   ¦       ¦   ¦--reorder_gauge_toQuda          0.010681870 TM_QUDA:
49   ¦   ¦       ¦   ¦--reorder_spinor_eo_fromQuda    0.015871140 TM_QUDA:
50   ¦   ¦       ¦   ¦--reorder_spinor_eo_toQuda      0.004953463 TM_QUDA:
51   ¦   ¦       ¦   ¦--twoflavour_input_overhead     0.003285359 TM_QUDA:
52   ¦   ¦       ¦   °--twoflavour_output_overhead    0.011988443 TM_QUDA:
53   ¦   ¦       °--mul_r_gamma5                      0.004574722        :
54   ¦   ¦--sw_invert_nd                              0.043876160        :
55   ¦   °--sw_term                                   0.100460900        :
56   ¦--ndcloverrat_4_6:init_ndrat_monomial           0.000009320        :
57   ¦--ndcloverrat_4_6:ndrat_heatbath                2.265935000        :
58   ¦   ¦--random_energy0                            0.132979700        :
59   ¦   ¦--solve_mms_nd_plus                         1.955709000        :
60   ¦   ¦   ¦--f_assign                              0.204246400        :
61   ¦   ¦   °--solve_mms_nd                          1.748212000        :
62   ¦   ¦       ¦--gamma5                            0.002139136        :
63   ¦   ¦       ¦--invert_eo_quda_twoflavour_mshift  1.742732000 TM_QUDA:
64   ¦   ¦       ¦   ¦--invertMultiShiftQuda          1.715788000 TM_QUDA:
65   ¦   ¦       ¦   ¦--reorder_spinor_eo_fromQuda    0.012970324 TM_QUDA:
66   ¦   ¦       ¦   ¦--reorder_spinor_eo_toQuda      0.003167139 TM_QUDA:
67   ¦   ¦       ¦   ¦--twoflavour_input_overhead     0.001384784 TM_QUDA:
68   ¦   ¦       ¦   °--twoflavour_output_overhead    0.005865898 TM_QUDA:
69   ¦   ¦       °--mul_r_gamma5                      0.003186799        :
70   ¦   ¦--sw_invert_nd                              0.043966710        :
71   ¦   °--sw_term                                   0.108778100        :

which can in turn be rather easily plotted as a multi-level barplot (I'll prepare an actual example hopefully by the end of the day). The idea is to be able to plot something like:

image

What do you think?

Marcogarofalo commented 2 years ago

I should be able to produce your plots already because I had all the information in the structure I create in my script. However, your approach should solve the problem I had in examining logs of killed jobs and can easily print the tree in a readable format.

kostrzewa commented 2 years ago

Thanks a lot for all the help @Marcogarofalo. Based on this and the "new" timing infrastructure, I've introduced the hmc_mk2 profile (profiling/hmc_mk2) which produces this kind of output: profile.pdf

Marcogarofalo commented 2 years ago

I think we can remove profiling/hmc since it will not work with the current log.

Marcogarofalo commented 2 years ago

in the log there is no information on the gauges subfunctions, it look like:

# : Time for ndrat_derivative 1.016409e+01 s level: 1 proc_id: 0 /HMC/ndcloverrat1:ndrat_derivative
# : Time for gauge_derivative 2.770116e+00 s level: 1 proc_id: 0 /HMC/GAUGE:gauge_derivative
# : Time for update_gauge 4.501816e-01 s level: 1 proc_id: 0 /HMC/update_gauge

these with

# QUDA: QUDA 1.0.0 (git v0.9.0-4675-g16b06db2d-sm_80)

and tmLQCD

Version 5.2.0, commit 20e85af0400ad5f9d7da7aa150caf096b119add9
kostrzewa commented 2 years ago

If you had UseExternalLibrary = quda set in the GAUGE monomial, you should see something like:

# TM_QUDA: Time for reorder_gauge_toQuda 9.961213e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_gauge_toQuda
# TM_QUDA: Time for computeGaugeForceQuda 1.246274e-01 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/computeGaugeForceQuda
# TM_QUDA: Time for reorder_mom_fromQuda 3.761868e-02 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_mom_fromQuda
# TM_QUDA: Time for add_mom_to_derivative 2.790029e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/add_mom_to_derivative
# TM_QUDA: Time for compute_gauge_derivative_quda 1.016333e+00 s level: 2 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda
# : Time for gauge_derivative 1.016342e+00 s level: 1 proc_id: 0 /HMC/GAUGE:gauge_derivative
# : Time for update_gauge 5.516283e-02 s level: 1 proc_id: 0 /HMC/update_gauge
# TM_QUDA: Time for reorder_gauge_toQuda 8.106367e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_gauge_toQuda
# TM_QUDA: Time for computeGaugeForceQuda 1.143318e-01 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/computeGaugeForceQuda
# TM_QUDA: Time for reorder_mom_fromQuda 3.010765e-02 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_mom_fromQuda
# TM_QUDA: Time for add_mom_to_derivative 2.823760e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/add_mom_to_derivative
# TM_QUDA: Time for compute_gauge_derivative_quda 1.577096e-01 s level: 2 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda
# : Time for gauge_derivative 1.577177e-01 s level: 1 proc_id: 0 /HMC/GAUGE:gauge_derivative
# TM_QUDA: Time for reorder_gauge_toQuda 7.356934e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_gauge_toQuda
# TM_QUDA: Time for computeGaugeForceQuda 1.130398e-01 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/computeGaugeForceQuda
# TM_QUDA: Time for reorder_mom_fromQuda 1.770647e-02 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/reorder_mom_fromQuda
# TM_QUDA: Time for add_mom_to_derivative 2.842720e-03 s level: 3 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda/add_mom_to_derivative
# TM_QUDA: Time for compute_gauge_derivative_quda 1.431690e-01 s level: 2 proc_id: 0 /HMC/GAUGE:gauge_derivative/compute_gauge_derivative_quda
# : Time for gauge_derivative 1.431770e-01 s level: 1 proc_id: 0 /HMC/GAUGE:gauge_derivative

and so on.

kostrzewa commented 2 years ago

I think we can remove profiling/hmc since it will not work with the current log.

I wanted to keep the option available to perhaps make it work with the current output as it is in principle more flexible than my implementation.

Marcogarofalo commented 2 years ago

If you had UseExternalLibrary = quda set in the GAUGE monomial, you should see something like:

Sorry I compile the version that support quda in the GAUGE monomial but I forgot to add it in the input. Should I rerun the benchmark?

Marcogarofalo commented 2 years ago

I think we can remove profiling/hmc since it will not work with the current log.

I wanted to keep the option available to perhaps make it work with the current output as it is in principle more flexible than my implementation.

ok, I will do the necesssary modifications.

kostrzewa commented 2 years ago

ok, I will do the necesssary modifications.

I was thinking more in terms of "let's keep them around if we perhaps need them". If you see an urgent need, please go ahead, but don't feel obliged to invest time into this now.