Open vinopm opened 4 years ago
Heads up @mrow84 @bobturneruk - the "data pipeline api" label was applied to this issue.
@mrow84 In the Contact Tracing Model, there is an input distirbution where we pass in a list of bins and a list of their respective weights. The model creates an enumerated integer distribution out of this. Can we add in support for this type of distribution as well?
Something like this?
[population-ages]
type = "distribution"
distribution = "enumerated"
bins = [0, 15, 25, 55, 65, 90]
weights = [0.1759, 0.1171, 0.4029, 0.1222, 0.1819]
@mrow84 In the Contact Tracing Model, there is an input distirbution where we pass in a list of bins and a list of their respective weights. The model creates an enumerated integer distribution out of this. Can we add in support for this type of distribution as well?
Something like this?
[population-ages] type = "distribution" distribution = "enumerated" bins = [0, 15, 25, 55, 65, 90] weights = [0.1759, 0.1171, 0.4029, 0.1222, 0.1819]
Yep, I think I would call this a categorical distribution.
@mrow84 In the Contact Tracing Model, there is an input distirbution where we pass in a list of bins and a list of their respective weights. The model creates an enumerated integer distribution out of this. Can we add in support for this type of distribution as well? Something like this?
[population-ages] type = "distribution" distribution = "enumerated" bins = [0, 15, 25, 55, 65, 90] weights = [0.1759, 0.1171, 0.4029, 0.1222, 0.1819]
Yep, I think I would call this a categorical distribution.
Ok, so we will have support for this format:
[population-ages]
type = "distribution"
distribution = "categorical"
bins = [0, 15, 25, 55, 65, 90]
weights = [0.1759, 0.1171, 0.4029, 0.1222, 0.1819]
The only thing I wonder is if we might want the categories to be strings rather than numbers, and then parse them when it is deemed appropriate.
@mrow84 the categories refer to ranges:
i.e. Age 0-15 -> 0.1759 probability
In some sense that makes me feel like a string may be even more appropriate, in that you could encode the range more explicitly - they do something like that in simple network sim. I am happy for you to leave the range stuff, but I do think that if the string conversion isn't too difficult that it would be a positive, because it is useful to be able to form discrete distributions over more arbitrary categories.
@mrow84 I agree with you. Will require quite a bit of work to support the string range format, i'll add that as a TODO, but let's keep this format for now as a first step. Would that be ok?
I have been going through the distributions trying to come up with standardised parameterisations. This is what I have now, with links to wikipedia for parameterisation references, and includes both the distributions required by the java models and the EERA model (@kzscisoft / @peter-t-fox). I realise that it may in some circumstances require change file contents, so please let me know if this is too much of a drag, but I think we may already have some differences anyway, so someone is going to have to change something!
k
: shapetheta
: scale
mu
: meansigma
: standard deviation
a
: lower boundb
: upper bound
lambda
: rate
lambda
: rate
bins
: category labelsweights
: category probabilities
alpha
: shapebeta
: shapeAlso adding
n
: trialsp
: probability of success
n
: trialsp
: probabilities of success
Discussed with @mrow84 that we would need to add support for 'exponential' and 'uniform' distributions in all Standard API implementations.
We also need to update the Standard API spec with more information about which distributions are supported.
I believe the complete list of distributions we would support are:
Gamma
Exponential
Linear
[x] Update API specification document
[x] Python implementation
[ ] Java implementation
[ ] C++ implementation
[ ] Julia implementation