JoeyBernhardt / p-temp

temperature and nutrient effects on productivity
MIT License
0 stars 1 forks source link

p-temp progress update: "p-temp_simecol_fit" #32

Open marcus-campbell opened 7 years ago

marcus-campbell commented 7 years ago

Intro I have consolidated all of the code developed over the past few months into a single script: "p-temp_simecol_fit", which I intend to be the final version of our attempts at parameter estimation using the simecol package. This script draws on essentially all of the tools that the package has to offer, and is highly optimized; I believe it represents our best shot at obtaining parameter estimates using this package.

Objectives At the time of writing, some of the scripts in the "p-temp-marcus folder" are much more optimized than others; many of the scripts contain idiosyncrasies in the code that are the results of jury-rigged solutions, and the code within can be difficult to read and understand. Some scripts have been essentially abandoned and have only been kept on github for record-keeping purposes.

My objectives in producing this script was to include all of the code necessary for parameter estimation in a single file, so our approach could be easily shared and replicated. Simply running the entire script from beginning to end will output our parameter estimates in an easily reproducible fashion. I also used this opportunity to increase the clarity and speed of the code wherever I could.

Outline of our Approach

1. Read the dataframes: p-temp-processed.csv and march_29_cell_data.csv (the latter corresponds to the initial algal biovolume observations). Vertically bind these dataframes together and reconfigure the resulting dataframe to make it suitable for reading into our model-fitting functions. Of particular note is that fact that we now scale the algal biovolume by a factor of 250. This is because the original biovolume estimates were measured in units of 1mL, but we measured daphnia by the total density of individuals in the 250mL beaker. By scaling the algal biovolume by 250, we obtain the total algal biovolume in the beaker, which is more appropriate to use in our model.

2. Declare our consumer resource model. Like in the past scripts, our model is a system of paired differential equations. However, in a departure from past versions, we drop the half-saturation constant, b. There are a few reasons for this:

(1) Dropping b decreases the parameter search space by 1 dimension, which corresponds to a vast improvement in the efficacy of the fitting algorithm, and also makes choosing the initial parameters easier.

(2) The observed dynamics of the observational data don't appear to show any effect of a half-saturation constant.

(3) Trying to estimate both the attack rate a and the half-saturation constant b using least-squares methods (which we are using) is not recommended due to the ubiquitous presence of autocorrelation between a and b. See: https://www.princeton.edu/~cap/papers/KnightesPeters_BB2000.pdf

3. The data are split into 8 dataframes corresponding to all of the individual combinations of temperature and phosphorus treatments. For each of these 8 dataframes, we now run the fitting algorithm 50 times using randomly selected initial parameters. Parameters are drawn from a uniform distribution with user selected minima and maxima. I selected the bounds of the parameter distributions by looking at the parameter estimates for our "best" fit (judged both by visual inspection and least-squares fit), and then allowing 1 order of magnitude of deviation in either direction.

For example: If we obtain a fit that looks very good (correct bifurcation; correct number, placing, and sign of inflection points), and also has a low sum of squared deviations, then assume the fitted parameters are r=1 and K=10.

For determining the bounds of our random sampling, we would then sample r in the range of [0.1,10] and sample K on the interval [1,100]

Using 50 replicates per dataset was essentially an arbitrary decision. Using 10 replicates often resulted in most of the datasets having poor fits. Using 100 replicates was time-prohibitive. I chose to use 50 in order to gauge the performance of the code and to do troubleshooting. If we stay with the least-squares approach used in this script, I recommend using at least 100 replicates. My thoughts on this are fairly nuanced and I think it's best discussed in person.

4 The best fit is then determined by using the least-squares metric. Specifically, the set of parameters that results in a model that produces the minimum sum-of-squared-deviations from the observed data is considered the "best". The "best" parameter sets for all of the datasets are then saved into a dataframe, along with information corresponding to treatment type, etc.

Visual confirmation of each fit is still necessary!

Most recent commit: SHA: 596554124f10e43ce6d7dff6b159630a40986e75