ComputationalRadiationPhysics / haseongpu

HASEonGPU: High performance Amplified Spontaneous Emission on GPU
http://www.hzdr.de/crp
Other
7 stars 6 forks source link

Issue17 GrayBat #58

Closed erikzenker closed 9 years ago

erikzenker commented 9 years ago

A first try to use GrayBat as framework for communication in haseongpu. The lines of code were reduced in comparison to the mpi case and I think it is more intuitive. Some code review would be nice, regarding:

IMPORTANT: This will not compile, since, GrayBat needs to be clone into the include directory: cd include git clone -b topic-haseongpu https://github.com/erikzenker/GrayBat.git

slizzered commented 9 years ago

About the CI-testing: you could edit the .drone.yml to clone the graybat-repo, I think? Not sure if drone supports additional git clone operations, but it would be nice during testing

About your questions:

How can GrayBat be integrated into haseongpu best (submodule etc.)

That one is pretty difficult. As a first step, I would suggest a subtree - It is easier to manage than a submodule (and no need to initialize the submodules when first cloning the git repo). A real submodule has more features, but I don't think we need too much features, since GrayBat is used only as a dependency here.

Is it necessary to encapsulate master and slave code into functions

Maybe not necessary, but it would make the loop-structure much more readable.

Do we need non-blocking in this first version (so the master could also do computations)

I don't think so, we have an issue for that (#13). The real question is: do we need to throw away all the blocking code if we want to do it properly later? Because that would be a lot of unnecessary work and we might do it non-blocking from the start

Can the code readability be improved somehow

What I like about the readability:
What could be improved:
for(Vertex v : cage.hostedVertices){
    if(v == master){
        // loops internally over all samples and also sends abort-messages in the end
        masterFunction(v, samples);
    }else{
        // internally loops forever until abort-message received
        slaveFunction(v);
    }
}
erikzenker commented 9 years ago

Completed to work on your comments, but to conclude this pull request I first need to transform GrayBat into a clean library stucture and add it to haseongpu as a subtree depedency.

I will also try to fix issue #9 by grouping all interface data into some ExperimentDescription type (this calcPhiAse interface is a mess!!! :cry: )

Furthermore, GrayBat is able to solve issue #13. I need to further investigate if this issue is totally covered by the current source state.

With GrayBat, issue #12 would become obsolet, since MPI code is then not needed anymore !

slizzered commented 9 years ago

so many nice things!

erikzenker commented 9 years ago

Oh yeah I like them too :balloon:

slizzered commented 9 years ago

:whale:

slizzered commented 9 years ago

I just noticed: when you work on #9, you could also prepare for #8, by adding this parameter to your ExperimentDescription

erikzenker commented 9 years ago

A lot of commits later:

bussmann commented 9 years ago

a(we)s(om)e

erikzenker commented 9 years ago

Git subtree and rebase are no good friends, so I will leave it with this five commits, but this is ne main commit ...we om ?

slizzered commented 9 years ago

My first thoughts for now:

There will be a more detailed review of code lines tomorrow. Probably also a full simulation with MATLAB to see if results are correct :)

slizzered commented 9 years ago

Unfortunately, testing on Hypnos found some problems and/or smaller annoyances

build + environment

This is probably more a list of TODO stuff (might need an issue)

setenv('LD_LIBRARY_PATH', ':/opt/pkg/devel/boost/1.56.0/gnu/4.8.2/64/opt/lib:/opt/pkg/mpi/openmpi/1.7.4/gnu/4.8.2/64/opt/lib:/opt/pkg/infiniband/psm/3.1/lib64:/opt/pkg/devel/cuda/6.5/lib64:/opt/pkg/devel/cuda/6.5/lib:/opt/pkg/numlib/mpfr/3.1.2/lib:/opt/pkg/numlib/mpc/1.0.1/lib:/opt/pkg/numlib/gmp/5.1.1/lib:/opt/torque/lib');

code

This is a more serious list of problems

</home/eckert62/haseongpu/src/importance_sampling.cu>:176 [CUDA] Error: an illegal memory access was encountered
terminate called after throwing an instance of 'thrust::system::system_error'
terminate called recursively
  what():  an illegal memory access was encountered
[1]    7994 abort      ./bin/calcPhiASE --input=input/cylindrical --parallel-mode=threaded --ngpus=2
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
[kepler010:09561] *** Process received signal ***
[kepler010:09561] Signal: Aborted (6)
[kepler010:09561] Signal code:  (-6)
[kepler010:09561] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f6a532b4340]
[kepler010:09561] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f6a526f2cc9]
[kepler010:09561] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f6a526f60d8]
[kepler010:09561] [ 3] /opt/pkg/compiler/gnu/gcc/4.8.2/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x155) [0x7f6a52ffd1f5]
[kepler010:09561] [ 4] /opt/pkg/compiler/gnu/gcc/4.8.2/lib64/libstdc++.so.6(+0x5e256) [0x7f6a52ffb256]
[kepler010:09561] [ 5] /opt/pkg/compiler/gnu/gcc/4.8.2/lib64/libstdc++.so.6(+0x5e283) [0x7f6a52ffb283]
[kepler010:09561] [ 6] /opt/pkg/compiler/gnu/gcc/4.8.2/lib64/libstdc++.so.6(+0x5e4de) [0x7f6a52ffb4de]
[kepler010:09561] [ 7] /opt/pkg/compiler/gnu/gcc/4.8.2/lib64/libstdc++.so.6(_ZSt20__throw_out_of_rangePKc+0x67) [0x7f6a5304fda7]
[kepler010:09561] [ 8] ./bin/calcPhiASE(_ZNKSt6vectorIdSaIdEE14_M_range_checkEm+0x31) [0x499595]
[kepler010:09561] [ 9] ./bin/calcPhiASE(_ZNSt6vectorIdSaIdEE2atEm+0x23) [0x499351]
[kepler010:09561] [10] ./bin/calcPhiASE(_Z10calcPhiAseRK20ExperimentParametersRK17ComputeParametersRK4MeshR6ResultjjRf+0x12c5) [0x49f00f]
[kepler010:09561] [11] ./bin/calcPhiASE(_Z14processSamplesI19CommunicationVertexIN7graybat4CageINS1_19communicationPolicy4BMPIENS1_11graphPolicy3BGLINS5_14SimplePropertyES7_EEEEES9_EvT_SB_RT0_RK20ExperimentParametersRK17ComputeParametersRK4MeshR6Result+0x414) [0x48c5c4]
[kepler010:09561] [12] ./bin/calcPhiASE(_Z17calcPhiAseGrayBatRK20ExperimentParametersRK17ComputeParametersRK4MeshR6Result+0x5ef) [0x489e2f]
[kepler010:09561] [13] ./bin/calcPhiASE(main+0x5e2) [0x4d3d60]
[kepler010:09561] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f6a526ddec5]
[kepler010:09561] [15] ./bin/calcPhiASE() [0x484cb3]
[kepler010:09561] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 9561 on node kepler010 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
slizzered commented 9 years ago

Excessive testing is finished. Results:

bussmann commented 9 years ago

Could you please document your test results?

slizzered commented 9 years ago

I will gladly do so. Is there a preferred form of documentation? Would the calling parameter + runtime + the experimental results (140 floating point numbers) as a comment here be sufficient? Or is there already an established (and more sophisticated) way of doing this?

bussmann commented 9 years ago

I think posting those numbers here would be o.k., but in addition link to the code & version that produced these results

slizzered commented 9 years ago

The following tests were used to verify that the introduction of GrayBat didn't introduce any new value errors.

Setup

The baseline used plain MPI for communication and the previous dev tip (268b6cdac988b0cbabf2). The new code used GrayBat as introduced by this pull request (9734a6d7b0b0077).

The programs were both executing the MATLAB example (in the respective commits, see the folder examples/matlab_example). In the file laserPumpCladdingExample, the parameter parallel_mode was set to 'mpi' or 'graybat'.

Each of the examples was launched on the Hypnos cluster, using a submit-file of the following form:

#!/bin/bash
#PBS -q k20
#PBS -l nodes=2:ppn=8
#PBS -l walltime=96:00:00
#PBS -N HASE_GB_integration_testing

## ENVIRONMENT ############################
. /opt/modules-3.2.6/Modules/3.2.6/init/bash
export MODULES_NO_OUTPUT=1
module load ~/own.modules.kepler
export -n MODULES_NO_OUTPUT
uname -a
echo " "
cd ~/haseongpu_integration_GrayBat

matlab -r laserPumpCladdingExample

Run with MPI:

Run with GrayBat

Results

To obtain gain values for a plot of gain over time in the context of the experiment that was simulated, the values in the table below need to be processed with f(x) = (x*x)*1.0263. Doing element-wise comparisons showed that the maximum (absolute) difference between both runs on the same timesteps is about 1.1959 * 10^-4.

The results of both runs are in strong agreement with the experimental measurements.

Timestep parallel_mode='mpi' parallel_mode='graybat'
0 0.79924105727549743516391345110605470836162567138672 0.79924105727549743516391345110605470836162567138672
1 0.82967902324631426225209906988311558961868286132812 0.82967902324631426225209906988311558961868286132812
2 0.86065456031345011211897144676186144351959228515625 0.86065450313067914933640167873818427324295043945312
3 0.89211832511361943698346976816537790000438690185547 0.89211810728248064350509594078175723552703857421875
4 0.92401615502554501624388194613857194781303405761719 0.92401623826757317559099647041875869035720825195312
5 0.95629206890433193777312226302456110715866088867188 0.95629192411461994005605902202660217881202697753906
6 0.98888668754145281347689433459891006350517272949219 0.98888653707702722783778881421312689781188964843750
7 1.02173808570663582351301101880380883812904357910156 1.02173817873233740982641393202356994152069091796875
8 1.05478200451977599527708662208169698715209960937500 1.05478251922912025229095434042392298579216003417969
9 1.08795304120930591551541510852985084056854248046875 1.08795398637353391002591251890407875180244445800781
10 1.12118754143175847204361161857377737760543823242188 1.12118730012255429784318039310164749622344970703125
11 1.15441376778894189136792647332185879349708557128906 1.15441616775567257313639402127591893076896667480469
12 1.18757060236735023650567200093064457178115844726562 1.18757327950239011116195797512773424386978149414062
13 1.22059087125025222952956482913577929139137268066406 1.22059518124641308567390751704806461930274963378906
14 1.25341253721188450320767060475191101431846618652344 1.25341547345259174406351121433544903993606567382812
15 1.28596870203849400482454257144127041101455688476562 1.28596996069458757716574837104417383670806884765625
16 1.31820267654264688950149775337195023894309997558594 1.31820305872252041545777956343954429030418395996094
17 1.35005190906503891312695486703887581825256347656250 1.35005048202490529618557957292068749666213989257812
18 1.38146797525670961270805037202080711722373962402344 1.38146363532074967217511130002094432711601257324219
19 1.41239445052566869875931843125727027654647827148438 1.41238532440068964568524734204402193427085876464844
20 1.44278030816677871328579385590273886919021606445312 1.44276663101022339930068483226932585239410400390625
21 1.47257918396038078867604781407862901687622070312500 1.47256029770383189969606974045746028423309326171875
22 1.50175200337506309367086032580118626356124877929688 1.50173824651050935585772094782441854476928710937500
23 1.53026062628951020627710022381506860256195068359375 1.53024695623880702122221464378526434302330017089844
24 1.55807495601162626641666975046973675489425659179688 1.55806794009806237610860080167185515165328979492188
25 1.58516826653503306587822407891508191823959350585938 1.58516661223194255114776751725003123283386230468750
26 1.61150446813444436777729151799576357007026672363281 1.61149635999899976113169941527303308248519897460938
27 1.63705129419724437767058589088264852762222290039062 1.63705701197019637405105640937108546495437622070312
28 1.66181242039397636389708168280776590108871459960938 1.66180615954696841995996692276094108819961547851562
29 1.68574895477104513830113319272641092538833618164062 1.68579391296374359043852564354892820119857788085938
30 1.70891654394968206531757459742948412895202636718750 1.70890417744886047302088627475313842296600341796875
31 1.73123826144268466720177457318641245365142822265625 1.73124617480678377745562102063558995723724365234375
32 1.75271816879267117172958023729734122753143310546875 1.75273022793506827454734775528777390718460083007812
33 1.77337309509880070024223641667049378156661987304688 1.77340266314167260830458872078452259302139282226562
34 1.79322672738207966602885790052823722362518310546875 1.79326263011179221074087308807065710425376892089844
35 1.81228652243421883838436770020052790641784667968750 1.81230867797559858090039597300346940755844116210938
36 1.83051731644692705636146001779707148671150207519531 1.83058824865950908744594016752671450376510620117188
37 1.84797563587853019839712942484766244888305664062500 1.84807696638669982647229517169762402772903442382812
38 1.86468727618973373338917554065119475126266479492188 1.86476669543899920000740166869945824146270751953125
39 1.88061527122078064877541692112572491168975830078125 1.88071105449894759864548632322112098336219787597656
40 1.89585068886644791952278410462895408272743225097656 1.89594059335585174430605093220947310328483581542969
41 1.91036076525293307959429967013420537114143371582031 1.91042439603979463669247707002796232700347900390625
42 1.92416093602389981498390625347383320331573486328125 1.92422297132776209949156509537715464830398559570312
43 1.93731707782169948472983378451317548751831054687500 1.93736111751595108110279852553503587841987609863281
44 1.94983881781624801554642090195557102560997009277344 1.94983790907548693027706576685886830091476440429688
45 1.96174018371838720931066291086608543992042541503906 1.96167749028284887913287093397229909896850585937500
46 1.97299087412253282280971689033322036266326904296875 1.97295693422107198955472995294257998466491699218750
47 1.98363227114736195844102439878042787313461303710938 1.98359814618550811538000289147021248936653137207031
48 1.99372853188743692776085936202434822916984558105469 1.99375725961057170820822648238390684127807617187500
49 2.00327890913641759595975599950179457664489746093750 2.00330538803419777593717299168929457664489746093750
50 2.01233923114299706469410011777654290199279785156250 2.01234183287668599859898677095770835876464843750000
51 1.95023160016343144462780401227064430713653564453125 1.95027297351027040228643727459711953997611999511719
52 1.89322795374638919163601258333073928952217102050781 1.89329912543583933626223370083607733249664306640625
53 1.84073301501177066796799408621154725551605224609375 1.84079619648676784393614980217535048723220825195312
54 1.79220296041852877877431637898553162813186645507812 1.79230480779378442690585870877839624881744384765625
55 1.74729334909795053221159832901321351528167724609375 1.74736590926347279406627421849407255649566650390625
56 1.70558599870012406185537656710948795080184936523438 1.70560105463898992184113012626767158508300781250000
57 1.66672427792216581998729907354572787880897521972656 1.66673651575352632647764039575122296810150146484375
58 1.63041204732545264022292030858807265758514404296875 1.63046909773576076396750522690126672387123107910156
59 1.59646467185172946656734893622342497110366821289062 1.59655740282186764389393829333130270242691040039062
60 1.56467260036847877202603740443009883165359497070312 1.56474757966014044185953935084398835897445678710938
61 1.53477530242396653648029314354062080383300781250000 1.53487134101274058650687948102131485939025878906250
62 1.50667034020087386991804123681504279375076293945312 1.50678993018102280743164556042756885290145874023438
63 1.48020445962831814767923788167536258697509765625000 1.48030881059096275365050132677424699068069458007812
64 1.45522976530226211266949576383922249078750610351562 1.45531760722638581206922481214860454201698303222656
65 1.43161965112932065835593675728887319564819335937500 1.43170879745828605322799376153852790594100952148438
66 1.40927320834398450699609384173527359962463378906250 1.40934442994500530588197761971969157457351684570312
67 1.38809272305428121896397897216957062482833862304688 1.38815833863763371525124057370703667402267456054688
68 1.36798600640117018478747468179790303111076354980469 1.36805450147048524023318805120652541518211364746094
69 1.34887759522140981971460860222578048706054687500000 1.34893652252569218319422361673787236213684082031250
70 1.33070528648555974626788156456314027309417724609375 1.33075947426518115257465524337021633982658386230469
71 1.31339704230966991538309684983687475323677062988281 1.31345503551990727686415993957780301570892333984375
72 1.29689729692917143921704337117262184619903564453125 1.29694967787497961175802174693671986460685729980469
73 1.28115896859266986673731025803135707974433898925781 1.28120552528991993312956765294075012207031250000000
74 1.26612672693373728982635384454624727368354797363281 1.26617532994802894918962010706309229135513305664062
75 1.25176233234970069041480655869236215949058532714844 1.25181085517648416072233885643072426319122314453125
76 1.23802417658835173241982374747749418020248413085938 1.23806479000507030363564808794762939214706420898438
77 1.22486707719197340793471084907650947570800781250000 1.22490790678459893214835574326571077108383178710938
78 1.21226240896487413856164039316354319453239440917969 1.21230300273921764997453465184662491083145141601562
79 1.20017786193752384882316164294024929404258728027344 1.20021803724651343614482357224915176630020141601562
80 1.18858477153435693196570355212315917015075683593750 1.18861959982906850719075464439811185002326965332031
81 1.17745167816981544106624824053142219781875610351562 1.17748466745651336751166127214673906564712524414062
82 1.16675576782464118785753726115217432379722595214844 1.16678824401733294280347763560712337493896484375000
83 1.15647070451125677514880862872814759612083435058594 1.15650379192581875820167169877095147967338562011719
84 1.14658267452833340094286995736183598637580871582031 1.14661223735440587212508489756146445870399475097656
85 1.13706241542127473032053330825874581933021545410156 1.13709104312910036505002153717214241623878479003906
86 1.12789441686193891989375970297260209918022155761719 1.12792253554450860875135731475893408060073852539062
87 1.11906167968145897617660011746920645236968994140625 1.11908891938988364067597558459965512156486511230469
88 1.11054676081391767716866070259129628539085388183594 1.11057410835162162499045734875835478305816650390625
89 1.10233320496028719404080220556352287530899047851562 1.10235987416302450014882197137922048568725585937500
90 1.09440809924834869804044501506723463535308837890625 1.09443327445908167305788083467632532119750976562500
91 1.08675682609065549222293611819623038172721862792969 1.08677954837872903226525522768497467041015625000000
92 1.07936561738552749822872556251240894198417663574219 1.07938843096769754303920763049973174929618835449219
93 1.07222369768967817904581352195236831903457641601562 1.07224574411309214738707851211074739694595336914062
94 1.06531948389703678969908651197329163551330566406250 1.06534139258319848764244852645788341760635375976562
95 1.05864301564959051304981585417408496141433715820312 1.05866419949786183529738536890363320708274841308594
96 1.05218333743465786156434660369995981454849243164062 1.05220383384603044518712522403802722692489624023438
97 1.04593118632138715184964894433505833148956298828125 1.04595187200631611901258111174684017896652221679688
98 1.03987769126161277988273923256201669573783874511719 1.03989749624459903998285881243646144866943359375000
99 1.03401443916468016581688971200492233037948608398438 1.03403382826481005096752596728038042783737182617188
100 1.02833380984387634526910915155895054340362548828125 1.02835278123493134572186136210802942514419555664062
101 1.02282781829253943683966099342796951532363891601562 1.02284632144387832575205266039120033383369445800781
102 1.01748987714621863531760936893988400697708129882812 1.01750776097920314633427096850937232375144958496094
103 1.01231265114685742290134840004611760377883911132812 1.01232976746142799839844883535988628864288330078125
104 1.00728989758798048725907392508815973997116088867188 1.00730654239090489099339720269199460744857788085938
105 1.00241560026868392618837333429837599396705627441406 1.00243205832733295324032951612025499343872070312500
106 0.99768392973211783569809085747692734003067016601562 0.99770002431477866622344663483090698719024658203125
107 0.99308948011685127532643946324242278933525085449219 0.99310521979315868890125784673728048801422119140625
108 0.98862708334435245305371608992572873830795288085938 0.98864249878933185833318475488340482115745544433594
109 0.98429185264265162125241204194026067852973937988281 0.98430674615032709429129909040057100355625152587891
110 0.98007898593704401157822303503053262829780578613281 0.98009357988187639776356263610068708658218383789062
111 0.97598386952431637197946656669955700635910034179688 0.97599808677926047639772377806366421282291412353516
112 0.97200244866920026964862699969671666622161865234375 0.97201630568644348251439168961951509118080139160156
113 0.96813062343399680642619387072045356035232543945312 0.96814402747458272457947714428883045911788940429688
114 0.96436457389193619427203429950168356299400329589844 0.96437742736616449690245644887909293174743652343750
115 0.96070032518943448973658405520836822688579559326172 0.96071291082469145727884551888564601540565490722656
116 0.95713419878504546467468117043608799576759338378906 0.95714651528037331118525798956397920846939086914062
117 0.95366350766951124562353925284696742892265319824219 0.95367487535777195617470169963780790567398071289062
118 0.95028412777753801243818543298402801156044006347656 0.95029540409137192025212925727828405797481536865234
119 0.94699319649152213784759624104481190443038940429688 0.94700468528104564391867370432009920477867126464844
120 0.94378854203383011345351860654773190617561340332031 0.94379969544477770870827271210146136581897735595703
121 0.94066698226927292214583076201961375772953033447266 0.94067737918934257734804305073339492082595825195312
122 0.93762534876472802825020380623755045235157012939453 0.93763524431943712222903286601649597287178039550781
123 0.93466120075114433873864072666037827730178833007812 0.93467072970431031020410728160641156136989593505859
124 0.93177213533205527351555019777151755988597869873047 0.93178119472694687086544718113145790994167327880859
125 0.92895562696451494666405324096558615565299987792969 0.92896429433667893071913113089976832270622253417969
126 0.92620925836542267184370302857132628560066223144531 0.92621823654002888304148655151948332786560058593750
127 0.92353096421017433215183700667694211006164550781250 0.92354005783888715175322658978984691202640533447266
128 0.92091899371932794959860757444403134286403656005859 0.92092803999458872077354953944450244307518005371094
129 0.91837150866872896415316063212230801582336425781250 0.91837987517684482874358309345552697777748107910156
130 0.91588572374136267839617175923194736242294311523438 0.91589372733883989141645542986225336790084838867188
131 0.91346047016327713841121749283047392964363098144531 0.91346780413008676902109073125757277011871337890625
132 0.91109342395911585565926316121476702392101287841797 0.91110044454146565939822721702512353658676147460938
133 0.90878306949442477780110039020655676722526550292969 0.90879029770718133818263595458120107650756835937500
134 0.90652770722140352255280504323309287428855895996094 0.90653504409004259656512658693827688694000244140625
135 0.90432618176469226067837325899745337665081024169922 0.90433265567123521933012852969113737344741821289062
136 0.90217647392867383882020249075139872729778289794922 0.90218218537580208149506688641849905252456665039062
137 0.90007668712734290039634288405068218708038330078125 0.90008234614228554981707475235452875494956970214844
138 0.89802636107652700214742935713729821145534515380859 0.89803132986832745476846184828900732100009918212891
139 0.89602317473862913566051702218828722834587097167969 0.89602768891309547694135062556597404181957244873047