Open inesll opened 2 years ago
I had a little bit of a think how this might be done.
One potential route, though potentially quite costly, would be to add an additional property for each symptom registered which records the date at which the symptom last onset (more specifically I think this would be the earliest date at which the symptom was set as present by any disease module and has been set as present by any disease module continuously up to the current simulation date). This would be simple to code up and also to add a helper method which would allow checking the duration for which any symptom has been present for a given person or persons. The main problem I would forsee with this is that it is adding a lot of columns to the dataframe (which slows lots of dataframe access operations) and for a lot of the symptoms this information is possibly not needed.
To reduce the number of columns added, we could instead add a property to each Symptom
specifying whether the date of onset should be recorded, and only add a property recording the symptom onset date in this case. This would also allow the option of running with or without this information being recorded depending on whether particular modules are registered and/or flags set in a module to indicate whether to use this information. We would have to be careful about dealing with symptoms that are registered in multiple modules in this case.
Thanks for thinking about this @matt-graham.
Could the information about date of onset be held somewhere other than the sim.population.props
dataframe, in order to avoid that performance penalty?
Also, as we don't need the exact date -- just need to know whether it was a 0, 1 or 2 weeks ago -- I wondered whether the date (or some "coarsification" of the date) could be overloaded into the BitSetHandler thing --- so that what is being stored there is not the module, but the module and week number (say) of onset. e.g. HIV-02, HIV-03, etc.
I am deferring 100% to you for guidance though!
Could the information about date of onset be held somewhere other than the
sim.population.props
dataframe, in order to avoid that performance penalty?
We could potentially store the information in a dictionary or similar in SymptomManager
but ideally I think we should avoid storing simulation state outside of the population dataframe were possible, as this breaks the model that the population dataframe (plus event queue) represents the state of the simulation at a given simulation time and so can be used to for example check if two simulation runs are in consistent state or to allow saving / restoring a simulation (which would allow trying out some alternative calibration approaches such as particle filters which could fit well with running things in batch on Azure). While there are already various places we depart from this, I think it would be good to try to keep to this principle as far as possible.
Also, as we don't need the exact date -- just need to know whether it was a 0, 1 or 2 weeks ago -- I wondered whether the date (or some "coarsification" of the date) could be overloaded into the BitSetHandler thing --- so that what is being stored there is not the module, but the module and week number (say) of onset. e.g. HIV-02, HIV-03, etc.
Doing some form of coarsification like this is an interesting idea to reduce the overhead. The bitset handler uses a 64-bit integer type to store the values so we have a maximum of 64 bits to play with per bitset handler column; that means doing a one-hot encoding of the week number wouldn't work but we could potentially for example use 6 bits to represent an integer [0, 63] which would mean we could store the yearly week number for 10 symptoms per column. This would however create an ambiguity for any symptoms which last for over a year, which while possibly rare I think we'd probably still want to come up with some way of dealing with (possibly just reporting a fixed >1 year length for any symptoms which onset >= 52 weeks ago).
If the number of symptoms that we would need to record the onset date for is relatively small then I think the overhead of having a few extra columns will not be too significant (for the model configuration currently the default for the scale_run.py
script there are 367 columns in the population dataframe of which 41 are symptom columns), and I think just directly having a onset date column for each would probably be the simplest solution. If it later turned out we need the dates for a lot of symptoms we could then switch to using a coarser representation in the population dataframe. To simplify making this change we could have the method for evaluating the symptom duration only return a coarse-grained value so that we are free to use a coarser representation in the backend in the future.
To be more concrete my suggestion would be something like:
record_onset_date
argument (and corresponding class attribute) to the Symptom
class initialiser (defaulting to False
) that allows indicating whether the onset date of a a symptom should be recorded.SymptomManager.register_symptom
to deal with the case that mulitple Symptom
instances are registered which match in all properties other than the value of record_onset_date
, in which case we should register the Symptom
with record_onset_date=True
as if any module requests the date to be recorded this should take precedence, and not raise a DuplicateSymptomsWithNonIdenticalPropertiesError
.SymptomManager.pre_initialise_population
to create a Property
with name onset_sy_{symptom_name}
and type Types.DATE
for each symptom for which record_onset_date = True
.SymptomManager.change_symptom
to record the onset date when adding symptoms which have record_onset_date
set to True
. SymptomManager.get_symptom_duration_in_weeks
method which returns the duration a symptom has been present in weeks (returning some special value for above a cut off period - ) for one or more persons, raising an error if this is called for a symptom for which record_onset_date
is not set to True
.To be more concrete my suggestion would be something like:
- Add an optional boolean
record_onset_date
argument (and corresponding class attribute) to theSymptom
class initialiser (defaulting toFalse
) that allows indicating whether the onset date of a a symptom should be recorded.- Change
SymptomManager.register_symptom
to deal with the case that mulitpleSymptom
instances are registered which match in all properties other than the value ofrecord_onset_date
, in which case we should register theSymptom
withrecord_onset_date=True
as if any module requests the date to be recorded this should take precedence, and not raise aDuplicateSymptomsWithNonIdenticalPropertiesError
.- Change
SymptomManager.pre_initialise_population
to create aProperty
with nameonset_sy_{symptom_name}
and typeTypes.DATE
for each symptom for whichrecord_onset_date = True
.- Update
SymptomManager.change_symptom
to record the onset date when adding symptoms which haverecord_onset_date
set toTrue
.- Add a
SymptomManager.get_symptom_duration_in_weeks
method which returns the duration a symptom has been present in weeks (returning some special value for above a cut off period - ) for one or more persons, raising an error if this is called for a symptom for whichrecord_onset_date
is not set toTrue
.
This sounds very sensible to me. I think we only have one use case of this at the moment (when we need to know if the symptom has occurred in the last one week, the last two weeks, or longer) but it sounds like the generic approach will stand us in good stead for possible future applications.
Need a helper function in SymptomManager to check on the current duration of symptoms (start_date_of_symptom - self.sim.date (now)) = duration in days. E.g. use for iCCM check for danger signs to refer to health facility: diarrhoea>=14 days, cough>=14 days, fever >= 7days