Strain identity - Githubissues

JoRussell-IDM commented 5 years ago

Each strain that circulates and initiates an infection should be given a unique strain ID (allowing for integration of spatial genetic epi modelling (with Albert/Josh)

Each infection that an individual receives draws from a pool of infections that are contained with that node, and the individual's susceptibility retains the unique identifier for that strain.

DGoes-IDM commented 5 years ago

There's some stuff in the base Node classes that talks about strains, but it doesn't appear to function the way you're describing
There is an existing StrainIdentity class that we could use, where a strain is defined by AntigenID and GeneticID
For purposes of narrowing this down to a single ID, I could set AntigenID to 0 and vary just GeneticID
I'm thinking that I'll create some container in NodeMalaria with all of the StrainIdentities available in that node, and draw a StrainIdentity out of the node for each infection
There's probably something to be done with a node not starting with any StrainIdentities (in fact, @JoRussell-IDM: how do we want to populate a node with strains that we'll draw from?)
I also want to follow up later with another dev about whether this is the right way to do this (really, with everything, but)
When you say that susceptibility retains the unique ID, do you mean a running "memory" of all the strains seen? Do you want anything done with that for now?

JoRussell-IDM commented 5 years ago

Great stuff Dan!

There's some stuff in the base Node classes that talks about strains, but it doesn't appear to function the way you're describing

I now appreciate this better. I'm looking at NodeMalaria.cpp line 93/98 to see how it creates instances of individual objects (humans, individuals) as a potential parallel basis for creating a strain instance. I'm guessing that demographics/config params tell us how many times to call addNewIndividual. It might be a far simpler function that just populates an array of unique GeneticIDs based on a value we specify in the config (value == 100 means 100 calls to this function, and 100 strains in the array). The only other argument(s) we might want to pass is something like a distribution of relatedness? Or a distribution of relative proportions? This is something we will want to implement the most basic behavior of and fiddle with later.
There is an existing StrainIdentity class that we could use, where a strain is defined by AntigenID and GeneticID

For our purposes a single GeneticID should suffice

For purposes of narrowing this down to a single ID, I could set AntigenID to 0 and vary just GeneticID

Perfect

I'm thinking that I'll create some container in NodeMalaria with all of the StrainIdentities available in that node, and draw a StrainIdentity out of the node for each infection

Also perfect: a pool of strain ids available and the selection of a particular GeneticID for each new infection (maybe drawn on their respective weights? that other argument we passed above?)

There's probably something to be done with a node not starting with any StrainIdentities (in fact, how do we want to populate a node with strains that we'll draw from?)

I think as above we want to draw on a value specified in the config that tells us how many times to call the initialize_GeneticID function. I should talk with Albert and Edward about this. Especially how to specify default behavior, is this really something that I want every researcher to have to specify per sim setting??

I also want to follow up later with another dev about whether this is the right way to do this (really, with everything, but)

Let me know what they say!

When you say that susceptibility retains the unique ID, do you mean a running "memory" of all the strains seen? Do you want anything done with that for now?

I worry a little bit that in our hedging the capacity for including a genetic explicit model, we will introduce overly cumbersome memory requirements on the default behavior of the model. I forsee different levels of simple to complex behavior

A. The simplest behavior possible that still includes strain diversity is the idea that a node has a value (k, 'node_strain_diversity') that is populated from the config given a simple low-medium-high mapping (low: k = 1, medium k = 10, high k = 100), and individuals query this value for immunity calculations. This value is updated every timestep by tracking the measured prevalence in the node (what fraction of individuals are currently infected) and rescaling k for the next timestep.

B. The next most complicated behavior is the introduction of unique genetic_IDs. In this case, upon node initialization, a node creates an object like an array of unique genetic_IDs whose size maps is mapped to a value (k, 'node_strain_diversity') that is populated from the config given a simple low-medium-high mapping (low: k = 1, medium k = 10, high k = 100). Then whenever individual humans experience a create infection moment, they sample from this array of unique genetic IDs that stay a property of that infection for the duration of that infection. k then is updated at every timestep to be a measurement of how many unique genetic IDs remain in the population of infected individuals in that node at that timestep. There should be baseline rate at which new genetic IDs are created and added to the list (new_strain_birth_rate) to account for mutation/recombination and balance the stochastic loss of some unique strain IDs during baseline transmission (ideally our behavior is that at baseline, a low diversity node stays a low diversity node, and likewise for medium and high diversity nodes, this new strain birth rate then is a great target for calibration in a new site where the particulars of transmission make the mapping of diversity-equilibrium strain frequencies hard to predict!).

In this model, individual will accumulate a counter that increments every time they are infected with a strain they haven't seen before (based on a probabilistic draw given the frequencies assigned to each unique genetic_ID in that node" counter += prob(!seen). This counter represents an ammalgamation of immune recognition forces as individuals experience strain diversity in a setting.

C. The third and most complex model is the most genetic-explicit model where individuals retain a list of genetic_IDs that they have seen, each with the capacity for holding some information about resistance markers and with some capacity to transfer resistance from one strain to another given recombination. Genetic_IDs are sampled from a node-specific list as above in B with strain frequencies and with new strain birth rates. The counter for immune calculations per individual would in this case need to be directly compared upon infection to the list of previously seen infection genetic_IDs seen by that individual previously.

These three different behaviors could be built as configurable logic like: A. Using a default config value of False for a flag like "Include_explicit_genetics".
B. Using a config value of True for a flag like "Include_explicit_genetics".
C. Using a config value of True for a flag like "Include_explicit_genetics" and a value of True for "Include_resistance_genetics".

JoRussell-IDM commented 5 years ago

Talked with Monique and already trying to clarify this down to simpler language.

Lets stick with two classes of behavior:

A. Implied genetics B. Explicit genetics

A. In implied genetics the strain diversity in a node is approximated by evaluating a function that predicts the nonlinear relation between strain diversity and prevalence (FIGURE NEEDED). This removes the onus on user to specify an expected diversity and instead rely on starting prevalence (fraction of infected individuals in the node). In the simplest case this strain diversity onto a pre-determined mapping of strain diversity to immune modifier values. This means this diversity can be updated at every node update step by just querying population level prevalence! (Drug resistance in this case should be a node property where the spread of resistance happens at the level of adjacent nodes and resistance as a property is distributed among its constituent infections where it can interact with drugs) this sounds complicated and worthy of its own issue ticket.

B. In explicit genetics, a node is created with a property from the config that specifies either a quantitative or relative strain diversity (could be estimated from prevalence as above) that should dictate the length of the array of unique genetic_IDs to be created and distributed in this node. Then upon create_infection calls, individuals sample from this array with weights assigned to this array of genetic_IDs. This frequency array should be stable under baseline transmission, but validating this will require extensive model exploration. Individuals retain a list of genetic_IDs that they have seen, and upon infection a new genetic idea is appended to this list. Then when individuals are delivered an infection they can check if the unique genetic_ID of that strain is one they have seen already and the fraction of strains seen over strains available can be used to parameterize the strain diversity component of their immune modifier calculation for every new infection. This can also be the framework to incorporate importation events (setting the counter for this new strain back to zero, likelihood of seeing this strain being 0), drug resistance (incorporating information about resistance metrics to the genetic ID associated with each infection),

DGoes-IDM commented 5 years ago

A) Is this description ready for trying to turn into code? I think it makes rough sense without the functions fully specified yet, but I can wait if you want to firm it up.

B) I think this also makes sense, though now that I'm looking, A and B look competing. Should they exist side-by-side, and if not, which would you like to start with?

JoRussell-IDM commented 5 years ago

CURRENT STATUS:

Default behavior: infections have no explicit strain identity. easy peasy.

Future behavior: (this is not a complete description, and will require extensive testing and development to get right!) Strain identity is only important in some subset of simulation activities where tracking explicit genetics is important. When specified by a config param: individuals can engage in the assignment of an array that represents the reservoir of genetic diversity within a node where said array is populated with a set of unique genetic_IDs for each strain in that node. The length of this array should be specified by the user in the demographics file, or if not, by default set to uniform diversity across all nodes. Each ID in the array should be associated with a relative frequency representing its frequency in the population. The default value for these frequencies should be uniform unless specified by the user (ie passing a list of frequencies to each node). Individuals when given infections draw from this array. Individuals track which strains they have seen (a compiled list of previous infection IDs) and the immune modifier is calculated somehow given this info. Interactions of co-occuring strains in an individual have the capacity to generate new strain ids (recombination?) and swap resistance markers?)

JoRussell-IDM commented 5 years ago

From conversation with Albert:

Provide a graph of human to human connections where edges are infection ids.

Each node needs an array of genetic_IDs from which individuals draw from upon infection initialization.

Questions for Albert: what is the size of that array in the simplest 1-node case? Does it need to update? What is the rate of introduction of mutation/novelty to that array? Could these be configurable params for a simple 1 node calibration activity to discover stable equilibria of genetic diversity given our transmission model?

JoRussell-IDM commented 5 years ago

IndividualMalaria.cpp Line 409 StrainIdentity strain;

Follow this breadcrumb to understand strain identity/suid in 1.0 better.

JoRussell-IDM commented 5 years ago

From recent conversations with Josh and Albert, a priority output from Malaria2 sims would be a tdf of timestamped transmission events:

columns(time, host1 host2) row(1, a, b) row(2, b, c)

a primary feature of interest would be for every infection object to hold in its memory a structure like:

"infection_id" + "current_host_id" + "parent_host_id"

eventually this string should be able to accomodate a list of markers (as in single locus drug resistance markers like kelch13, or a list of barcodes for capturing nuances in transmission change and co-infection driven recombination).

One potential barrier to this is that explicit human-vector-human events are probably obscured when using the vector cohort model. It would be useful to explore

When are strain ids initialized before transmission starts (sim/node initialization?)
How can strain ids passed be transmitted when new infection events happen (infection initialization?)
Are individual suids available during infection initialization?
Maybe after infections are initialized within hosts, all subsequent transmission events just result in a shift of the current host id into the parent host ID slot, with the new host id taking the place of the current host id. these events are logged in some report where the timestamp, and full strain id string are saved so that Albert can back out the host-host connectivity and use it to calibrate his genetic model.

JoRussell-IDM commented 5 years ago

10/21/2019 1-on-1 Architectural description for how transmission works Node Contagion Individual

StrainAwareTransmissionGroups.cpp Ln 48

DecayRates by Route

Ln 110 we strip ID information from strain for each contagion bin into a matrix Every time we deposit contagion Every time we have an ind with positive gametocytemia Indoor Outdoor: Is there a basis for distinguishing these from a parasite ID perspective

Matrix object that has antigen route and group

Parasite Transmission Requirements Human to Vector

Must occur only after Successful human feed Depends on Gametocyte Positivity of individual Infectivity to Mosquitoes is proportional to Gametocyte density Infectivity to mosquitoes is subject to negative regulation by a scale factor dependent on immune_modifier value from Susceptibility Strains should be treated roughly independently from one another with separate probabilities of success to transmit to vector

Vector to Human

JoRussell-IDM / updated_infection_and_immunity

Strain identity #11