Python analysis window - Githubissues

magland commented 4 months ago

Merged code from #94

Created "Analysis (Py)" tab.

Next step: provide sampling draws as input to script

magland commented 4 months ago

You can now run the following in analysis.py

import pandas as pd

df = pd.DataFrame(sp_sampling.draws, columns=sp_sampling.parameter_names)

print(df)

magland commented 4 months ago

Here's that example as gist project

http://localhost:3000/?project=https://gist.github.com/magland/8d987a04c9db7bee8636960a69ae7c7f

magland commented 4 months ago

There are a number of things to think about. What should the global variable (now sp_sampling) be called? Right now sp_sampling.draws has all the chains concatenated, and you also get sp_sampling.num_chains. It reflects the internal structure of the draws in the data model. Should that be adjusted? What's most intuitive for the user creating analysis.py?

WardBrian commented 4 months ago

Right now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this

WardBrian commented 4 months ago

Should that be adjusted? What's most intuitive for the user creating analysis.py?

The most intuitive will depend pretty strongly on the user and their use case, but I think there are 3 main 'types' of uses:

the user cares about the markov chains as markov chains, so having a num_samples x num_chains x num_parameters (or the transpose of this) numpy array will be the most useful for them
The user is using an existing tool which relies on pandas dataframes or arviz InferenceData, so whatever we provide should be easy to convert to these
The user cares about one specific parameter in the model, in which case providing a way to extract that one item as a numpy array of the correct shape incredibly useful. This is what cmdstanpy calls stan_variable and is implemented in stanio. Note that the reason this is tricky is when I say "parameter", I am including things like a parameter declared to be a matrix, not just one column of output

magland commented 4 months ago

Right now the UI lets you click over to analysis.py/analysis.R when you have not yet run sampling. Moving those tabs down to the area we only show after sampling has completed seems like a good idea to prevent this

Okay I have reorganized the tabs. I do think it's important to be able to view and even edit the analysis scripts even prior to the completion of sampling. So those tabs are present. But if you try to execute the scripts before sampling is complete, you get a (helpful?) error message.

magland commented 4 months ago

Thinking about the interface for analysis.py to get the draws data.

We want something that can be made backward compatible so that as we make improvements to the interface, existing SP Projects will continue to function, at least to a reasonable degree.

The only way I can think of for doing that is to provide python imports that implement specific interfaces, and the user can select which they want to use.

So here's an example analysis.py

import matplotlib.pyplot as plt

# Import a specific interface
from sp_util import load_draws_v1

# Internally loads the draws from a file
# The actual implementation can change over time, but the API should be steady
draws = load_draws_v1()

# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])

# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()

Here, by virtue of this specific import the user gets a particular interface, with the expectation that it will be maintained in a backward-compatible manner.

I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way. I think analysis.py should be a well-defined python script that does not assume any magic variables, but assumes certain modules (e.g., sp_util) exist.

In the implementation from my latest commit, the sp_util module does not come from pypi... but rather it is hard-coded in pyodide... so it's internal to the stan-playground web app. The actual sampling data (draws, parameter names, etc) are communicated behind-the-scenes by writing to a file that is available to the sp_util module. The load_draws_v1 function first loads the data from that file and then prepares an object that has the get_dataframes() method. It also has a draws.get_dataframe_longform() method.

WardBrian commented 4 months ago

from sp_util import load_draws_v1

I think doing semantic-versioning-by-symbol-name is pretty un-Pythonic as well. The code we provide to load the draws is the one thing we completely control, so it's the thing I'm least worried about backwards compatibility for, to be honest -- numpy or pandas updating and breaking something seems much more likely than us doing it. Especially if what we provide is an object, we can always just add more methods on it if we want newer functionality, and leave the old ones.

If we ever really needed to do something backwards-incompatible, I think the better way to handle that is something like a version parameter in the meta storage/query parameters.

I prefer this import method compared with injecting a mysterious variable into the mix, because python isn't supposed to work that way.

I think Python has enough magic in the language that this would be fine. I think the bigger argument as for why to do it by providing a global variable is that it makes it easier to implement a nice 'download this for use locally with cmdstanpy' easy -- we wouldn't need to provide a module to be imported or anything, just some simple-ish glue code which leaves behind the variable name we choose

magland commented 4 months ago

@WardBrian Are you suggesting the script would instead be the following, where "draws" is the special variable?

import matplotlib.pyplot as plt

# Get a list of dataframes, one for each chain
a = draws.get_dataframes()
print(a[0])

# Plot a histogram of `lp__`
plt.figure()
plt.hist(a[0]['lp__'])
plt.show()

WardBrian commented 4 months ago

Something like this, yes

magland commented 4 months ago

@WardBrian @jsoules

Implemented draws object for analysis.py based on what we had discussed. Here's a gist to try

http://localhost:3000?project=https://gist.github.com/magland/d811a59035037dd0e19e73900048a8fb

The analysis.py for this example, which illustrates the functionality.

import numpy as np
import matplotlib.pyplot as plt

# Histograms
for pname in ['gamma', 'beta', 'phi_inv', 'lp__']:
    plt.figure()
    plt.hist(draws.get(pname))
    plt.title(pname)
    plt.show()

print('PARAMETERS')
print('==========')
for p in draws.parameter_names:
    print(p)

print('')
print('PARAMETER VALUES')
print('================')
print('gamma:', draws.get('gamma'))
print('beta:', draws.get('beta'))
print('phi_inv:', draws.get('phi_inv'))

print('')
print('OTHER')
print('=====')
print('Mean of lp__:', np.mean(draws.get('lp__')))
print('Shape of y:', draws.get('y').shape)

print('')
print('DATAFRAME')
print('=========')

df = draws.as_dataframe()
print(df)

Note that y is a matrix parameter, but it gets flattened to a vector. Will want to use stanio instead, as noted in the comments in the source code. @WardBrian do you want to work on this? Here's the relevant script that gets loaded into the worker and then into the pyodide environment

https://github.com/flatironinstitute/stan-playground/pull/120/files#diff-356c7427b0952fcfd57ddd11e46493a6624870c92e611c335ee52f142272654d

magland commented 4 months ago

Assuming the structure of this looks good... In terms of merging order, it would help a lot to merge a basic version of this to minimize conflicts with other features we are working on (e.g., data generation). And then reopen a new PR with the needed tweaks to this one. What do you think?

magland commented 4 months ago

I have merged in recent changes from main and resolved some conflicts. In particular the non-relative imports.

WardBrian commented 4 months ago

@jsoules @magland this PR is now updated with respect to main. I've done a basic once-over to make sure everything still works as expected, but it can now be given a more thorough review!

magland commented 4 months ago

@WardBrian Nice! I have tested it and everything seems to be working.

As I mentioned I think we should do a separate PR for analysis.r and we can borrow code from #94 . LMK if you want me to work on that -- I can draft an initial version, but I will need help figuring out how to make the R draws object analogous to the python one.

WardBrian commented 4 months ago

I agree that should be a separate PR -- maybe after this is merged?

I'm currently trying to debug an issue with (I think) the TextEditor component - if I flip back and forth between the data.r and data.py tabs, the other text editors on the screen (e.g. main.stan) flicker/re-render -- maybe due to the monaco singleton getting touched? Any ideas @magland?

It's not introduced by this PR I'm pretty sure, just this makes it noticeable

Edit: This has been fixed

jsoules commented 4 months ago

Oh, one thing that would be nice that I didn't mention because it isn't strictly germane to this PR, but it would be good for us to document somewhere the expectations around the global data/memory sharing model that's used to pass data around among the workers/non-TS, non-Stan interpreters and our own app. Like, what's the data structure look like, what's the strategy.

magland commented 4 months ago

@WardBrian it looks like we had some simultaneous commits, doing some of the same things. Hopefully everything shakes out properly.

WardBrian commented 4 months ago

Yep, I resolved any conflicts

jsoules commented 4 months ago

Ok, looks to me like we're good to merge this--everybody good with that?

flatironinstitute / stan-playground

Python analysis window #120