Closed magland closed 4 months ago
You can now run the following in analysis.py
import pandas as pd
df = pd.DataFrame(sp_sampling.draws, columns=sp_sampling.parameter_names)
print(df)
Here's that example as gist project
http://localhost:3000/?project=https://gist.github.com/magland/8d987a04c9db7bee8636960a69ae7c7f
There are a number of things to think about. What should the global variable (now sp_sampling) be called? Right now sp_sampling.draws has all the chains concatenated, and you also get sp_sampling.num_chains. It reflects the internal structure of the draws in the data model. Should that be adjusted? What's most intuitive for the user creating analysis.py?
Right now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this
Should that be adjusted? What's most intuitive for the user creating analysis.py?
The most intuitive will depend pretty strongly on the user and their use case, but I think there are 3 main 'types' of uses:
num_samples
x num_chains
x num_parameters
(or the transpose of this) numpy array will be the most useful for themstan_variable
and is implemented in stanio
. Note that the reason this is tricky is when I say "parameter", I am including things like a parameter declared to be a matrix, not just one column of outputRight now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this
Okay I have reorganized the tabs. I do think it's important to be able to view and even edit the analysis scripts even prior to the completion of sampling. So those tabs are present. But if you try to execute the scripts before sampling is complete, you get a (helpful?) error message.
Thinking about the interface for analysis.py to get the draws data.
We want something that can be made backward compatible so that as we make improvements to the interface, existing SP Projects will continue to function, at least to a reasonable degree.
The only way I can think of for doing that is to provide python imports that implement specific interfaces, and the user can select which they want to use.
So here's an example analysis.py
import matplotlib.pyplot as plt
# Import a specific interface
from sp_util import load_draws_v1
# Internally loads the draws from a file
# The actual implementation can change over time, but the API should be steady
draws = load_draws_v1()
# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])
# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()
Here, by virtue of this specific import the user gets a particular interface, with the expectation that it will be maintained in a backward-compatible manner.
I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way. I think analysis.py should be a well-defined python script that does not assume any magic variables, but assumes certain modules (e.g., sp_util) exist.
In the implementation from my latest commit, the sp_util module does not come from pypi... but rather it is hard-coded in pyodide... so it's internal to the stan-playground web app. The actual sampling data (draws, parameter names, etc) are communicated behind-the-scenes by writing to a file that is available to the sp_util module. The load_draws_v1 function first loads the data from that file and then prepares an object that has the get_dataframes() method. It also has a draws.get_dataframe_longform()
method.
from sp_util import load_draws_v1
I think doing semantic-versioning-by-symbol-name is pretty un-Pythonic as well. The code we provide to load the draws is the one thing we completely control, so it's the thing I'm least worried about backwards compatibility for, to be honest -- numpy or pandas updating and breaking something seems much more likely than us doing it. Especially if what we provide is an object, we can always just add more methods on it if we want newer functionality, and leave the old ones.
If we ever really needed to do something backwards-incompatible, I think the better way to handle that is something like a version
parameter in the meta storage/query parameters.
I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way.
I think Python has enough magic in the language that this would be fine. I think the bigger argument as for why to do it by providing a global variable is that it makes it easier to implement a nice 'download this for use locally with cmdstanpy' easy -- we wouldn't need to provide a module to be imported or anything, just some simple-ish glue code which leaves behind the variable name we choose
@WardBrian Are you suggesting the script would instead be the following, where "draws" is the special variable?
import matplotlib.pyplot as plt
# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])
# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()
Something like this, yes
@WardBrian @jsoules
Implemented draws object for analysis.py based on what we had discussed. Here's a gist to try
http://localhost:3000?project=https://gist.github.com/magland/d811a59035037dd0e19e73900048a8fb
The analysis.py for this example, which illustrates the functionality.
import numpy as np
import matplotlib.pyplot as plt
# Histograms
for pname in ['gamma', 'beta', 'phi_inv', 'lp__']:
plt.figure()
plt.hist(draws.get(pname))
plt.title(pname)
plt.show()
print('PARAMETERS')
print('==========')
for p in draws.parameter_names:
print(p)
print('')
print('PARAMETER VALUES')
print('================')
print('gamma:', draws.get('gamma'))
print('beta:', draws.get('beta'))
print('phi_inv:', draws.get('phi_inv'))
print('')
print('OTHER')
print('=====')
print('Mean of lp__:', np.mean(draws.get('lp__')))
print('Shape of y:', draws.get('y').shape)
print('')
print('DATAFRAME')
print('=========')
df = draws.as_dataframe()
print(df)
Note that y is a matrix parameter, but it gets flattened to a vector. Will want to use stanio instead, as noted in the comments in the source code. @WardBrian do you want to work on this? Here's the relevant script that gets loaded into the worker and then into the pyodide environment
Assuming the structure of this looks good... In terms of merging order, it would help a lot to merge a basic version of this to minimize conflicts with other features we are working on (e.g., data generation). And then reopen a new PR with the needed tweaks to this one. What do you think?
I have merged in recent changes from main and resolved some conflicts. In particular the non-relative imports.
@jsoules @magland this PR is now updated with respect to main. I've done a basic once-over to make sure everything still works as expected, but it can now be given a more thorough review!
@WardBrian Nice! I have tested it and everything seems to be working.
As I mentioned I think we should do a separate PR for analysis.r and we can borrow code from #94 . LMK if you want me to work on that -- I can draft an initial version, but I will need help figuring out how to make the R draws object analogous to the python one.
I agree that should be a separate PR -- maybe after this is merged?
I'm currently trying to debug an issue with (I think) the TextEditor component - if I flip back and forth between the data.r and data.py tabs, the other text editors on the screen (e.g. main.stan
) flicker/re-render -- maybe due to the monaco singleton getting touched? Any ideas @magland?
It's not introduced by this PR I'm pretty sure, just this makes it noticeable
Edit: This has been fixed
Oh, one thing that would be nice that I didn't mention because it isn't strictly germane to this PR, but it would be good for us to document somewhere the expectations around the global data/memory sharing model that's used to pass data around among the workers/non-TS, non-Stan interpreters and our own app. Like, what's the data structure look like, what's the strategy.
@WardBrian it looks like we had some simultaneous commits, doing some of the same things. Hopefully everything shakes out properly.
Yep, I resolved any conflicts
Ok, looks to me like we're good to merge this--everybody good with that?
Merged code from #94
Created "Analysis (Py)" tab.
Next step: provide sampling draws as input to script