Reduce vignette compilation time?

mpadge commented 1 month ago

@abigailkeller I'm trying to debug which your package won't built on our systems for https://github.com/ropensci/software-review/issues/642. One difficulty is that each change takes hours to test, mainly because your (amazingly detailed and excellent!!) vignette takes over an hour to compile. I am wondering whether you can think of clever ways to reduce this time?

I commonly use tricks such as hard-coding outputs, saving pre-generated data sets and simply loading those during compilation, or pre-generating plots. The results presented in your ~~vignette~~book are so complex that you'd need combinations of all of those, and that would, unfortunately, increase the size of your package. This will make things very difficult... I'm not sure whether you plan on submitting this package to BioC or CRAN, but it's installed size is simply breathtaking, mostly because of the /src directory, which in turn is I guess mostly because of stanheaders. It's >300MB on your rcmdcheck action (and takes over an hour there), and >900MB on my local machine.

I have no idea what to do about that, but suspect it would be impossible to get such a monster accepted on CRAN, although I don't know about BioC? Regardless, this huge size maybe gives you some leeway in the vignettes, because the couple of MB that might be added through pre-generating results or images would be nothing compared with compiled object sizes, and you could ignore that.

I'm not suggesting you should do anything specific in response to this, but someway to reduce vignette compilation times would be helpful both for overcoming this initial problem, and likely also for reviewers, who'll also want/need to do same on their local machines. Any thoughts?

mpadge commented 1 month ago

Update: I suspect the build failures on our system are a system-level timeout after one hour of trying to build the vignette. An alternative for you may be to store the vignettes somewhere else outside the package itself, as has been done for another recent submission. We're thinking about how do update our docs and recommendations for packages with books instead of vignettes. If that idea appeals to you, I'll make sure to report back on how we might update our policies.

abigailkeller commented 1 month ago

Hi @mpadge Thanks so much for the time you've spent with this!!! So very helpful, and I really appreciate it. A couple thoughts:

It definitely makes sense to save some model objects and plots within the package (as .rda files in the data folder?) that can be used to build the vignette. It would be nice to have reviewers be able to build the vignette.
An alternative (or in addition?) would be to remove the vignette. I have an identical copy of the vignette as a bookdown page (also linked in the package's README). I had made this page previously because I find them cleaner to navigate through than the html vignettes.
It seems that the size of the src folder should be addressed. I have ideas of how to do this (combining redundant code), but I'm wondering if you have a general ballpark of how much I should be trying to reduce it, especially if I am trying to put the package on Cran?

mpadge commented 1 month ago

Hi @abigailkeller, I think it would definitely be a good idea to remove the vignette from this package, and just provide loud and clear links throughout to the book. Removing it will likely make the review process easier and smoother. Note that because of this package and one other current one (sits via https://github.com/ropensci/software-review/issues/596), we are going to update our guidelines for how to both structure and review packages like these. We'll ping you from that issue to keep you informed.

And the compiled size is also due to the vignette. Removing that yields a tarball of < 1MB on my machine, so all good there too.

mpadge commented 1 month ago

BTW @abigailkeller, i was looking through your code to check compiled object sizes and realised you've got a lot of code wrangling probability distributions in there. We've also got Probability Distributions Standards which you might consider documenting compliance with? It would likely be very useful for us if you could at least think about whether or not those standards might apply, as much of your code that would come under those standards is the auto-generated rstantools stuff, so likely to also arise in other packages.

The probability distribution standards are a bit different from other categories, as explained at the outset of the chapter, as they are ancillary, rather than primary, standards. That means there is no minimal number you'd be expected to comply with. You don't do any distributional fitting, optimisation, or integration, so none of those bits would apply. Nevertheless, from a quick scan of your code, I think the standards for testing may help to refine your current test suite.

What do you think?

abigailkeller commented 1 month ago

Hi @mpadge thanks again for your help with this. A few responses:

I just pushed a version of the package without the vignette and linked the user guide in the package DESCRIPTION and in the documentation for all functions. I can still try and reduce the /src directory if needed!
Thanks for the idea of the probability distribution standards. First just to clarify how the package is structured (and this makes me think I should do a better job communicating the package): the package stores a bunch of Bayesian models written in Stan in this folder. Each file here contains a variation of a model that uses environmental DNA data and "traditional" survey data to jointly estimate parameters. These model variations are accessed based on the type of input data and/or user-defined input parameters, including distributional assumptions. And then rstantools is used to compile the models that you find in the /src folder. These compiled models are then called during MCMC sampling.
I think the following probability distribution standards may apply:

PD1.0: Some of the distributions in the models are user-defined and others are fixed and assumed. I can definitely provide documentation for why these were chosen, but I'm curious where this documentation should go? The natural location for these might be in the .Stan files themselves where the models are defined (example here), but I'm not quite sure the srr syntax would work in a non-R file.
PD4.0: In my tests, I do "parameter recovery" tests, i.e., simulate data with known parameter values, run a package function with the data, and then test if the known parameter values are within the 95% credibility interval of the parameters' posteriors in the function output. This seems like how one would test for numeric equality in a Bayesian context? Although curious if you have other advice.
PD4.1: I'm not sure this makes sense here because the output values of MCMC are derived stochastically. I use the MCMC algorithm in rstan's sampling function, rather than my own MCMC algorithm, so I'm not sure I would be able to reproduce how the output values are derived.
PD4.2: This standard definitely makes sense for this package. One input parameter allows the user to make a distributional assumption about the traditional survey data. I can definitely add a test that shows that results will differ based on this assumption.

mpadge commented 1 month ago

That sounds great, and sounds to me like it would be useful for you to comply with prob. dist. standards. The abiding aim of our whole standardisation procedures is to improve software quality, and so if you think that compliance with these extra standards will in any way improve either your software itself, or the ability of others to understand why your software is the way it is, then please extend you compliance to include these extra standards.

Specific thoughts in response to your comments:

I just pushed a version of the package without the vignette and linked the user guide in the package DESCRIPTION and in the documentation for all functions. I can still try and reduce the /src directory if needed!

No need, that should be fine now.

Thanks for the idea of the probability distribution standards. First just to clarify how the package is structured (and this makes me think I should do a better job communicating the package): the package stores a bunch of Bayesian models written in Stan in this folder. Each file here contains a variation of a model that uses environmental DNA data and "traditional" survey data to jointly estimate parameters. These model variations are accessed based on the type of input data and/or user-defined input parameters, including distributional assumptions. And then rstantools is used to compile the models that you find in the /src folder. These compiled models are then called during MCMC sampling.

One suggestion for where to document this would be in a README.md file in the /src folder itself. GitHub will directly display README files in any sub-directory, and this is a really useful way to document sub-directory specific stuff like that. I'm not sure the srr package will currently correctly parse that, but feel free to open an issue there and we can easily accommodate that.

PD1.0: Some of the distributions in the models are user-defined and others are fixed and assumed. I can definitely provide documentation for why these were chosen, but I'm curious where this documentation should go? The natural location for these might be in the .Stan files themselves where the models are defined (example here), but I'm not quite sure the srr syntax would work in a non-R file.

Yes, it works in any files in the /src directory too. As above, open an issue in srr if you encounter any problems there.

PD4.0: In my tests, I do "parameter recovery" tests, i.e., simulate data with known parameter values, run a package function with the data, and then test if the known parameter values are within the 95% credibility interval of the parameters' posteriors in the function output. This seems like how one would test for numeric equality in a Bayesian context? Although curious if you have other advice.

Sounds great, and definitely sounds like it would be useful to document that.

PD4.1: I'm not sure this makes sense here because the output values of MCMC are derived stochastically. I use the MCMC algorithm in rstan's sampling function, rather than my own MCMC algorithm, so I'm not sure I would be able to reproduce how the output values are derived.

Documenting that, including statements of inability to strictly comply, would then be all the more important.

PD4.2: This standard definitely makes sense for this package. One input parameter allows the user to make a distributional assumption about the traditional survey data. I can definitely add a test that shows that results will differ based on this assumption.

That also sounds like it would be particularly helpful to document.

Thanks for your considered responses!

Also note that your submission made me realise that we had neglected to update our submission templates to include probability distribution standards. If you do decide to also comply with those standards, could you please edit your submission template and include an additional checked item for "Probability Distributions". Thanks!

abigailkeller commented 1 month ago

@mpadge Ok, great! Thanks so much for your help, again.

I just pushed a version of the package with the probability distribution standards included. I added some standards to a .Rmd file (that generates a github_document), but it looks like srr does not recognize the inclusion of the standards in this way. I'll create an issue with that package.

Thanks!!

abigailkeller / eDNAjoint

Reduce vignette compilation time? #2