incorporate SNe data types

beckermr commented 4 years ago

Had a chat with @reneehlozek and she is excited about this. So is @slosar so let's get it done!

reneehlozek commented 4 years ago

So I need to familiarize myself with sacc, but first few questions are:

is there place within sacc to specify the settings (e.g. which covariance to use etc) or should these be specified as a separate input file to likelihood
I'm assuming we can specify things like the dimensions of the data vectors inside sacc, since these are likely to change
what output should we specific (we only have a chi^2 value, I'm assuming this is fine) - does this need to be bundled in the output and can we also record the version inside the output (and input)

I think these are fairly obvious once I RTFM but there you go

beckermr commented 4 years ago

Right. So internally, SACC is not simply a list of values with tags/properties attached. Thus I think you can put anything in you want. In terms of covariance, it all goes into one giant, possibly block, covariance.

A related issue is #41. If we did that, we could concatenate independent datasets together. This would be nice in this case since the SNe stuff could be made and then tacked onto the end.

I would say SACC lacks a great way to attach covariance matrix elements to specific data points, which will be an issue.

reneehlozek commented 4 years ago

Ah ok, we have a few different covariance 'options' depending on the flags that the marvellous @dscolnic and I decided on (different combinations of systematics) so as long as we could keep a few covariances in the format that would be fine. As for the covariance indices, I think that would be ok, we'd assume that they were the same size and can throw up an error if not.

beckermr commented 4 years ago

Hmmmmm. Right now it supports one covariance per dataset. Just for my own knowledge, what is the actual data here? My simple brain has redshift and some sort of distance modulus.

reneehlozek commented 4 years ago

ah ok... the data are z vectors and distance modulus, but the covariance in mu depends on what systematics are included (all, only calibration, only bandpass uncertainties, only SALT2 model fits etc).. this will help with studies of systematics and to e.g. reproduce the SRD SN analysis. So I guess we'll need N copies of the sacc labelled by the assumptions?

slosar commented 4 years ago

The way I think this should be done is that there are several matrices in the sacc that give sufficient information that firecrown can build whatever covariance matrix needs to be used. If there are systematics that cannot be absorbed into cov matrix, also put in information that is needed to model them (and then have code in firecrown implement them). I think it is useful to have one special field which we call covariance matrix (for plotting, etc), but in reality thinkgs will be more complicated so the format has to be flexible enough to accommodate some numbers of 2d matrices and 1d vectors labelled by some tags. We definitelly don't want proliferation of 10 different saccs with various permuations of systematics.

beckermr commented 4 years ago

I'm pretty shrug on all of this. A bunch of files is simpler to deal with than a bunch of stuff in code, but do feel free to do w/e you all think is best. I won't be doing a lot of dev here of course.

slosar commented 4 years ago

Sure, nobody expects you to do lots of dev here. But if they are forced to put the logic into firecrown, it will be done right. Otherwise, I see through time: some poor bastard will write a script to takes combines things and spits out a bunch of files and the script will go to hell and everybody will soon be confused which file is which.

beckermr commented 4 years ago

The logic being in firecrown (whatever that means) doesn't help. Someone still has to write the logic to get the data in the SACC file in the first place (which is not much different from the logic to combine different parts of different files). Remembering which file is which is not different from remembering which tag means what in the SACC file. There is no free lunch here.

Simpler pipelines that follow unix-like principles will be easier to debug, maintain, and understand in the long run.

joezuntz commented 4 years ago

I'm inclined to agree with @beckermr here, adding the complexity to an earlier phase would seem simpler. Otherwise you have to start thinking about general covariance models within the file format.

beckermr commented 4 years ago

Thanks @joezuntz. Let’s try and get one data vector with cov and systematics in the format to start and then we can move on from there maybe.

slosar commented 4 years ago

Ok, I'm happy to be overriden here... One data vector and cov matrix to rule them all.

reneehlozek commented 4 years ago

I've made a branch with the first simple attempt here https://github.com/LSSTDESC/sacc/tree/desc-sn but am having some issues:

I’m not sure how to actually define a misc tracer, so I just added data points without a tracer for now, but ideally would like it to be properly defined
I want to have the z associated with each mu, but if I do that it won’t save the sacc file (it wasn’t only a table of mu values)

joezuntz commented 4 years ago

@reneehlozek some notes on this.

The tracer represents the general population of objects on which measurements are made. It records information that should go forward to parameter estimation. If the only thing that is needed is the name (i.e. if there's no other metadata that's needed for theory prediction) then the Misc tracer is suitable. In your notebook you would just add one like this:

S.add_tracer('misc', 'sn_ddf_sample')

(If on the other hand there are other things we need to record that are common to this sample then we should create a new Tracer subclass).

When we come to add data points we would use this tracer in a single-element tuple (each data point can have multiple indices because cross-correlations are included). Other information that is needed for a theory prediction, most obviously the redshift value, is recorded as a tag. So your calls would look like this:

S.add_data_point(sndata_type, ('sn_ddf_sample',),  mb[i], z=zcmb[i])

This works for me in your notebook, and I can save later.

You could add other tags as keyword on that if you wanted to save them too, like stretch and color values, or zhelio, or anything else.

LSSTDESC / sacc

incorporate SNe data types #42