StanJulia / StanIO.jl

Cmdstan related I/O operations
MIT License
1 stars 0 forks source link

Improved support for arrays of draws #2

Open WardBrian opened 8 months ago

WardBrian commented 8 months ago

I am working in a situation where my draws are a 2 or 3-D array, where one of the dimensions is # of parameters, and the others are meaningful (e.g. draws, chains)

If one of my parameters is, for example, a matrix[2,3], I'd love to be able to extract from my array a draws × chains × 2 × 3 array. So far as I can tell, the support here is all for dataframes

goedman commented 8 months ago

The "idea" is that with a (nested) DataFrame the matrix variable, e.g. m in the mixed_02_notebook.jl, one could do:

ndf.m

which gives an array of in this case 4000 matrix(4, 5) elements which can be reshaped in a 1000 x 4 x m[4, 5].

Is that what you are looking for? Is there an example in the stanio repo?

WardBrian commented 8 months ago

That seems fine for output, my issue is that what I currently have is not a DataFrame for input, but just an array.

In Python terms, my implementation works over numpy arrays rather than pandas DataFrames. It seems like the inputs here need to already be a dataframe, which if it could be relaxed would save some hassle

goedman commented 8 months ago

In the mixed_02_notebook.jl example, after:

df = StanIO.read_csvfiles(csvfiles, :dataframe) # Just copied here because df is used below.
a3d, col_names = StanIO.read_csvfiles(csvfiles, :array; return_parameters=true)

a3d is a 3-dimenional array (1000, 179, 4) and in col_names [:lp__, ..., m.1.1, m.1.2, m.1.3,..., m.4.5,...], the names as created by cmdstan.

You don't want to go the DataFrame route, so you can't use StanIO.find_nested_columns(df) => [:u, :m] or StanIO.select_nested_column(df, :m).

Are you asking to have additional methods like StanIO.find_nested_columns(a3d, col_names) => [:u, :m] and StanIO.select_nested_column(a3d, col_names, :m) => m, an array of 3 x 5 matrices?

WardBrian commented 8 months ago

Ideally for my use case would be either:

  1. I can feed a3d into something that outputs a dataframe. I don't mind using them, as long as I can stay a consumer of them rather than producing them
  2. Something like what you've described. I called it extract_reshape. It would be great to be able to write StanIO.extract_reshape(a3d, "m") and get back a 1000 × 4 × 3 × 5 array
goedman commented 8 months ago

Brian, do you want to get back a single array 1000 x 4 x 3 x 5 or a single array 1000 x 4 of 3 x 5 matrices or either.

If either, any preference for the default value of a keyword argument like nested = [false | true]?

The signature would look like extract_reshape(csvfiles, "m"; nested=true). The nested argument is ignored if "m" is a simple column, like "lp__" or "base" in this example.

WardBrian commented 8 months ago

Either one seems fine to me, in numpy there isn’t really a distinction between those

goedman commented 8 months ago

Just merged an initial version of extract_reshape(csvfiles, :m; nested=true).

For the case of nested=false I've for now taken a shortcut. I think the best way is to use a Julia package SplitApplyCombine.jl (and then probably a permutedims() step). I need to think a bit about introducing a new dependency although Andy Ferris is top notch. In general I find performance in this step (reading .csv files) not that important in my usage, but that might be different for you.

goedman commented 8 months ago

Decided that SplitApplyCombine.jl is a useful addition. Added it to release 1.1.1 (in the non-nested case of extract_reshape())

Updated the README to better position StanIO and included a usage section for its main functionality, extract_reshape().

I liked your statement above (in numpy there is little difference between nested and non-nested). What I like in the nested case is that I can say mean(df.m). I'm sure somewhere in Julia that is available for the non-nested case (or I should ask if Julia's mean() supports mean(a; dims=[3, 4])).

WardBrian commented 7 months ago

That seems like a helpful function. I think it still does a bit too much for my use-case by requiring CSV files.

I'm hoping to be able to use this package to provide some helper functions for a package which is all in-memory in Julia, so I never have filenames to pass around, just arrays and the list of variable names.

goedman commented 7 months ago

You have n_chains arrays of n_samples x length(list of variable_names)?

WardBrian commented 7 months ago

In my specific case I actually have a 3d array, with chains being the extra dimension in addition to the ones you listed

goedman commented 7 months ago

Like the a3d array StanSample.jl uses, n_samples x length(list of variables) x n_chains?

WardBrian commented 7 months ago

I have my indices in a different order, but if it’s best to transpose them I can.

In python I made the design choice to not care about any leading indices, I simply reproduce them in the output. So if you give me an array that’s 8 dimensions with the final one being length(list of variables) and ask to extract a variable which is a simple scalar in the Stan program, you’d get back a 7d array where those 7 dimensions are exactly the first 7 of the input

goedman commented 7 months ago

For now I've added an extract_reshape() method with the signature:

c = extract_reshape(a3d, col_names, var; nested=true) # where a3d has dimensions n_samples x n_vars x n_chains

I get a bit lost in your example above when you say "the final one being length(list of variables)" in case the length > 1. But if there is a simple permutation of the dimensions that can certainly be incorporated.

Over the years, still mostly supported in StanSample's read_samples(), I have tried many output_formats: :table, :tables, :mcmcchains, :keyedarray, etc. but in all cases after some time reverted back to :dataframe and :dataframes and more recently :nesteddataframe. The plural symbols do not append the chains (after diagnosing I don't think the chains are fundamental anymore).

WardBrian commented 7 months ago

I guess the advantage to the approach of not caring about the number of dimensions is some things naturally "work" nicely. For example, the output of optimization is actually just 1 dimensional, it's just length(parameter names). MCMC is frequently represented by either 2 or 3 dimensions, depending on whether or not the chains are concatenated together. But in general once you support the cases for 1,2, and 3 dimensional inputs, most likely what you wrote will work for more, even if the use cases are much rarer.

I'm not sure how well julia handles the idea of an array with an unknown (or, only known at runtime) number of dimensions