Arni/extract henry coeff

Surluson commented 5 years ago

Added two functions (along with docstrings and tests): extract_henry_coefficient: Grabs a DataFrame, names of the Pressure column and Adsorption column, and the number of points it uses to extract the Henry coefficient. I used MultivariateStats for a linear least squared regression (llsq). One issue I found was that the llsq function doesn't like the columns in the DataFrames because they're usually a Union between Floats and Missings.

fit_langmuir: Here I linearize N = (MKP)/(1 + KP) and solve it with llsq as well. I check to see if the first pressure point is 0, because after I linearize the I divide by the pressure (which blows up if P=0)

SimonEnsemble commented 5 years ago

very nice and clean code!

Could you also return information about the goodness of fit, e.g. the error? This way we can assess the fit systematically. Does this package provide error bounds on the slope? So KH = 12 +/- 3.0? If so, that would be really helpful for assessing goodness of fit and automatically selecting how many points to include.
For Langmuir fitting, I typically minimize sum over i pressure points: [n_i - M * K * p_i / (1 + K * p_i)]^2 which is a non-linear optimization problem. You first linearized the Langmuir isotherm, which will not minimize the above. I haven't thought much about which is better (it is probably context dependent). Do you know the difference between the two methods? I imagine one will over-emphasize high or low pressure data points? This well-cited paper suggests that the linearized version (while easier) is not a good way to go: https://pubag.nal.usda.gov/download/5945/PDF The Optim.jl package is good for non-linear optimization.

Surluson commented 5 years ago

I added a RMSE error between the points that were fitted to the slope for now. Error bounds for the slope would require us to assume some error for the datapoints no?

SimonEnsemble commented 5 years ago

The RMSE is smallest if you include two points, so based on RMSE, you would always choose two points to estimate the Henry coefficient. There is a tradeoff here though: include more points might mean more confidence in the slope estimate, but also might be less confidence if you are outside of the Henry regime. So how to systematically choose the number of points to include in a Henry coefficient calculation?

The error estimate in the slope is based on Gaussian distributed noise, but the variance of that Gaussian is inferred from the data. See here: https://www.chem.utoronto.ca/coursenotes/analsci/stats/ErrRegr.html I'm not 100% sure the estimate of error in slope is the way to go, but certainly RMSE is not suitable on its own since that always leads to the conclusion to use only two points.

Code looks good except in practice you need a more automatic guess for K and M; I suspect rarely will it converge to a global min with your default guess for K and M; it will get stuck in local min often. In pyIAST, I use the last data pt times 1.1 as the saturation loading, and use the first data points to get a Henry coefficient, giving K = M * KH. But you already thought of a better way to get a good starting point! you can keep your linearized function for Langmuir fitting. Use those as starting params for the nonlinear fitting routine! :+1: Maybe it would be beautiful to keep one function with method=:nonlinear default option passed, which calls method=:linear for starting params.

SimonEnsemble commented 5 years ago

@Surluson this might be stale, could you re-make a pull request now that we have Travis working?

Also it looks like the Langmuir guess is poor starting point; we can do better than having a default guess of M=1.0; this probably will only rarely converge. If you guess M as say 1.1 * maximum(df[:P]) then it will be more robust. Then you can estimate Henry coefficient from the first point, then get the Langmuir K from that as a better estimate. (this is simpler, two lines of code, than the linear fitting I suggested above)

SimonEnsemble commented 5 years ago

@Surluson checks failed.

looks like there are commented out lines there?
i just realize we can generalize this, since there is a lot of repeating code. how about model can be henry or langmuir, then we hv one function fit_isotherm(df, pressure_col_name, n_col_name, model) then we hv _guess(df, model) that guesses the params specific to each model. the guess for the henry coeff can be the slope based on the first data pt. for the langmuir, slope of the first data pt, then 1.1 times max value of adsorbed gas in the data. So instead of using the llsq function, we use the non-linear fitting routine for generality. the objective function is just different for every isotherm. this will allow us to incorprate all sorts of isotherm models. and it is as easy as adding if method == :mymodel then adding a guess and writing the objective function and functional form for that model.

Surluson commented 5 years ago

It seems all tests failed because MultivariateStats wasn't in the REQUIRE file. And I really like that idea, I'll make the changes asap

SimonEnsemble commented 5 years ago

@Surluson to get this PR merged how about we simplify it and

return the MSE without trying some way of normalizing it
assume the user will trim the dataframe themselves if they don't want any points to be used in the fitting (i.e. do not pass n_pts as the number of points to use or, more ambitiously, choose n_pts automatically for Henry fits)

Surluson commented 5 years ago

I've made some edits to the code. As for the first point: This is a value I used after calculating the slope from the first 3 points. At the time, the code was also trying to find the perfect number of data points to use for the fitting, so this was also used to make sure exactly 3 data points were used for the fitting procedure.

Surluson commented 5 years ago

Yeah sorry, those shouldn't be there. They've been removed

SimonEnsemble commented 5 years ago

fit_isotherm --> fit_adsorption_isotherm
simplified Henry test (thought we didn't need to load in a .csv for that and have an additional file laying around just for this simple test when we could put the data in the code)
added comments to explain guesses, plus comment to explain what _guess is doing.

please review my changes and approve, then we can merge the PR; it looks good to me!

Surluson commented 5 years ago

Everything looks good to me :+1: You'll have to approve the changes though, because I made the pull request.

SimonEnsemble / PorousMaterials.jl

Arni/extract henry coeff #100