brianstock / MixSIAR

A framework for Bayesian mixing models in R:
http://brianstock.github.io/MixSIAR/
90 stars 75 forks source link

run chains in parallel #146

Open AndrewLJackson opened 6 years ago

AndrewLJackson commented 6 years ago

I know we looked at jags.parallel before as a way to run chains in parallel. I have been playing around with this recently. I found this blog post example comparing different approaches. I would favour using mclapply() which is very close to the foreach() example in that post. The only extra steps that are required with these two similar approaches is to rebundle the mcmc list objects into a single mcmc list... the blog post provides the code to do this. I will try to find a bit of time to explore how easy it is to implement. https://stephendavidgregory.github.io/statistics/Jags-in-parallel

AndrewLJackson commented 6 years ago

I tried to code up a parallel version using mclapply() but in the end gave up as I couldnt get it to pass data long properly for some as yet unknown reason. In the end i tried a quick jags.parallel implementation and it works! I remember when i was visiting UCSD that Brian, Eric, Brice and I took a look at this but it either a) simply didnt work, or b) there was a bug in it we didnt like. I cant remember which, so I am venturing my now mis-named branch "parallelise-with-mclapply" with some caution. Can you guys remember?

ericward-noaa commented 6 years ago

Yes! I think we'd been working on the parallel-hotfix branch . I don't think there was a bug or anything - I just don't think we merged those changes in. I'll send the email thread with this around again -- part of the changes involved transitioning from R2jags to rjags

edbennett commented 5 years ago

Has any progress been made on merging this branch in? I'm looking at using MixSIAR on a high-performance computing cluster and it'd be great if I could get the performance boost from running the chains in parallel.

brianstock commented 5 years ago

Sorry, no I haven't done any development with MixSIAR since adding some features in the push for the PeerJ paper (LOO/WAIC, compare_models, combine_sources). It would be great if you want to take a look... I would hope it isn't too difficult. If anyone does, make sure the branch you're working on is up-to-date with master, test the examples, and submit a pull request.

edbennett commented 5 years ago

I've taken a quick look—merged master, installed and tried to run the tests. 9 of the tests fail with the same error—"object 'mcmc' not found"—which at a guess I presume means that something isn't getting packaged up and passed into do.call properly. My R isn't strong enough to debug this tonight; if I have time at some point I'll come back to it, but if someone with more R experience than me wants to try then I'd be grateful.

AndrewLJackson commented 5 years ago

My memory is that it wasn’t straight forward (at least at the time) and jags.parallel() wasn’t working as we required. It might be easier to spawn parallel calls to mixsiar via mclapply() or similar which would achieve the same. The clunky way is open lots of instances of R and run a few goes

––––––––––––

Dr Andrew Jackson, PhD, FTCD Associate Professor School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie

Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/research/theoretical/AndrewJackson.php

Trinity College Dublin, the University of Dublin is ranked 1st in Ireland and in the top 100 world universities by the QS World University Rankings.

On 22 Mar 2019, at 19:46, Brian Stock notifications@github.com<mailto:notifications@github.com> wrote:

Sorry, no I haven't done any development with MixSIAR since adding some features in the push for the PeerJ paper (LOO/WAIC, compare_models, combine_sources). It would be great if you want to take a look... I would hope it isn't too difficult. If anyone does, make sure the branch you're working on is up-to-date with master, test the examples, and submit a pull request.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/146#issuecomment-475756621, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADK1sCAfqLTC3ik1lvHwhNq3u2xboYW4ks5vZTL5gaJpZM4VOviR.

edbennett commented 5 years ago

I've looked into this some more and it appears at least on the surface that the bug is in R2jags::jags.parallel rather than in MixSIAR; removing "mcmc", "mcmc.list", from line 63 of https://github.com/cran/R2jags/blob/eb110395be5249112cfdbce16b0a17267a7e8677/R/jagsParallel.R allows the parallel-hotfix branch to pass all tests, and run an example problem in parallel at 3x speed (with 3 chains, so the linear speedup you'd expect). I've emailed the R2jags maintainer about this but haven't had a reply yet.

AndrewLJackson commented 5 years ago

Hi Ed

Yes, that is my memory of the issue too from when we last looked at it. I’m pretty sure I also emailed the maintainer but didn’t hear back. I wondered at the time if mclapply() might be used but we’d have to wrap all the required objects up into a new list of lists, create a wrapper function and then a function to stick it all together at the end into a mcmc object to enable easy analysis. But this seemed like a lot of work so I gave up! Happy if anyone else wants to give it a go.

Best wishes Andrew

––––––––––––

Dr Andrew Jackson, PhD, FTCD Associate Professor School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie

Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/research/theoretical/AndrewJackson.php

Trinity College Dublin, the University of Dublin is ranked 1st in Ireland and in the top 100 world universities by the QS World University Rankings.

On 25 Apr 2019, at 16:34, Ed Bennett notifications@github.com<mailto:notifications@github.com> wrote:

I've looked into this some more and it appears at least on the surface that the bug is in R2jags::jags.parallel rather than in MixSIAR; removing "mcmc", "mcmc.list", from line 63 of https://github.com/cran/R2jags/blob/eb110395be5249112cfdbce16b0a17267a7e8677/R/jagsParallel.R allows the parallel-hotfix branch to pass all tests, and run an example problem in parallel at 3x speed (with 3 chains, so the linear speedup you'd expect). I've emailed the R2jags maintainer about this but haven't had a reply yet.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/146#issuecomment-486724295, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZLLMEPHCM5NWZ5NV6AQEDPSHFQFANCNFSM4FJ27CIQ.

edbennett commented 5 years ago

Thanks for confirming that—you're right, reinventing a lot of machinery that has already been built to fix this issue seems a lot of work.

As an alternative approach I've pulled in the jags.parallel function from R2jags in its entirety, so that it can be fixed within MixSIAR. This can be found at https://github.com/edbennett/MixSIAR/tree/parallel-hotfix. I've tested this and it works with stock R2jags. I also added the facility to choose your level of parallelism (e.g. if you want to run on one core without thrashing the CPU switching between streams). Unfortunately it relies on using one private function from R2jags; I don't know what CRAN's policy on using functions not exported in the namespace is, but it might be that this version wouldn't be able to be published.

AndrewLJackson commented 5 years ago

hi ed

Great job! I too had started to look at the inner workings of R2jags and bypassing the issue but gave up. I don’t know either about this, but CRAN have become super strict recently and this might well contravene their policy. Ultimately though being able to run it via a git install of this branch seems ok to me at least for now.

One more thing to check is that the initial values passed to each chain are truly different. This is something that I read about with jags.parallel and possibly other parallel instances of JAGS.

The other approach I started to look into was dropping R2jags altogether (I don’t really like using it) and calling rjags directly and wrapping it in dclone as per https://stackoverflow.com/questions/23790452/using-jags-parallel-from-within-a-function-r-language-error-in-getname-envir

I also found this https://rdrr.io/cran/dclone/man/jags.parfit.html which might be an easy port

More completely this post shows various options for parallelizing jags, and jags.parfit() looks the most straight forward by far

If you had a bit of time to explore this that would awesome :)

Best wishes

Andrew

--

Dr Andrew Jackson, PhD, FTCD Associate Professor Irish Research Council Laureate School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/http://www.tcd.ie/Zoology/research/groups/jackson/groups/jackson/

On 26 Apr 2019, at 11:35, Ed Bennett notifications@github.com<mailto:notifications@github.com> wrote:

Thanks for confirming that—you're right, reinventing a lot of machinery that has already been built to fix this issue seems a lot of work.

As an alternative approach I've pulled in the jags.parallel function from R2jags in its entirety, so that it can be fixed within MixSIAR. This can be found at https://github.com/edbennett/MixSIAR/tree/parallel-hotfix. I've tested this and it works with stock R2jags. I also added the facility to choose your level of parallelism (e.g. if you want to run on one core without thrashing the CPU switching between streams). Unfortunately it relies on using one private function from R2jags; I don't know what CRAN's policy on using functions not exported in the namespace is, but it might be that this version wouldn't be able to be published.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/146#issuecomment-487011884, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZLLMAJZBEFA3KLSNXECCLPSLLGJANCNFSM4FJ27CIQ.

edbennett commented 5 years ago

One more thing to check is that the initial values passed to each chain are truly different. This is something that I read about with jags.parallel and possibly other parallel instances of JAGS.

Good question. R2jags::jags, called by jags.parallel, accepts an argument jags.seed, but this is then not used within the function. jags.parallel (both the original implementation and the copy I've made) use parallel::clusterSetRNGStream to seed the random number generator separately for each thread, so I think that this is being correctly handled. The test runs I've done have returned different diagnostic metrics for the three chains, so there is definitely some difference between them.

It'd definitely be nice to remove the dependence on an unsupported library. I'm not actually using MixSIAR myself, but am supporting a user using it on our HPC facility; currently we don't have capacity available to do a full re-parallelisation on this, but we might in the coming months. Even with parallelisation the user we're working with is running into job time limits, so we might need to look into implementing some kind of checkpoint-restart for either MixSIAR or rjags; if we do this then we'll definitely try and rework the parallelism at the same time.

galenholt commented 5 years ago

I've been looking into this a bit too, and am about to give up for the time being and use Ed's hotfix, but thought I'd throw out a bit more info in case it helps anyone in the future. It was relatively easy to change run_model to swap out R2jags::jags for either dclone:: jags.parfit or jagsUI::jags (which allows parallel by passing some arguments and is under active development). Looking under the hood, it shouldn't be much harder to just go to rjags directly, although the wrappers do take care of a fair amount of organizing and error catching. The catch with all of them is that the object R2jags produces (and MixSIAR expects) is quite different than any of these, and is created from the rjags results by the private functions Ed ran into. The object returned by jagsUI is close, and I managed to nearly rearrange it to match, but it never quite made it through MixSIAR::output_jags successfully, and I assume wouldn't work in anything else it's passed to. I'm not familiar enough with the class definitions of these objects to figure out what the issue is without a fair amount of time investment. Given what appears to be the unique structure of the object returned by R2jags::jags, it almost seemed like it was going to be easier to change downstream MixSIAR functions to accept the outputs of one of the other packages. But that becomes a much larger project than the simple swap from R2jags::jags to R2jags::jags.parallel I think all of us were hoping for when we each started. I hope this helps someone, even if I'm going to cut and run for the time being.

edbennett commented 5 years ago

@galenholt As a warning, the user working with this has come back to me this week and pointed out that the results from the parallel version don't match the results from the serial version, and have poor convergence. When I have time I'll try and look into why this is (I suspect it is something to do with how I've mashed together old and new branches), but for now I'd recommend against using my branch for production work.

AndrewLJackson commented 5 years ago

Hi all

Thanks for working on this. Your input is very helpful. I agree - I think this is a can of worms at best, and not far off Pandora’s box of odious treasure.

This is probably one of those additions to leave to a major (MAJOR) update in the future.. I think it apt to kick our can of worms down the road for now.

Best wishes and thanks again

Andrew

--

Dr Andrew Jackson, PhD, FTCD Associate Professor Irish Research Council Laureate School of Natural Sciences, Department of Zoology Trinity College Dublin, the University of Dublin Dublin 2, Ireland.

+353 1 896 2728 | a.jackson@tcd.iemailto:a.jackson@tcd.ie Twitter: @yodacomplexhttps://twitter.com/yodacomplex http://www.tcd.ie/Zoology/research/http://www.tcd.ie/Zoology/research/groups/jackson/groups/jackson/

On 9 Aug 2019, at 08:33, Ed Bennett notifications@github.com<mailto:notifications@github.com> wrote:

@galenholthttps://github.com/galenholt As a warning, the user working with this has come back to me this week and pointed out that the results from the parallel version don't match the results from the serial version, and have poor convergence. When I have time I'll try and look into why this is (I suspect it is something to do with how I've mashed together old and new branches), but for now I'd recommend against using my branch for production work.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/brianstock/MixSIAR/issues/146?email_source=notifications&email_token=AAZLLMGG65X2UKHGWRHYKWDQDUMUTA5CNFSM4FJ27CI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD353VJA#issuecomment-519813796, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAZLLMCAQSMKLEQ2EDEBWCTQDUMUTANCNFSM4FJ27CIQ.