LABSS / PROTON-T

Simulation of recruitment to terrorism
MIT License
1 stars 2 forks source link

Flesh out the "propensity" and "risk of radicalization" equations #2

Closed nicolaspayette closed 5 years ago

nicolaspayette commented 6 years ago

The numbers we have to measure the strength of the various factors that enter into the propensity and the risk equations are odds ratio. What we need for the model, however, are weights that, when normalized, add up to 1.0.

While it would be trivial to just normalize the odds ratio, I'm not sure this is the correct approach. My intuitive understand of odds ratio is limited, but I believe that an odds ratio of 1.0 means that something has no effect, so it should translate to a weight of 0.

Assuming all the odds ratio we use are greater or equal to one, a naive approach would be to just subtract one from each odds ratio and then normalize. That would mean that an odds ratio of 1.2 carries twice as much weight as an odds ratio of 1.1. Does it make sense?

This is not an urgent issue, as we won't need realistic weights until we start running analysis, but we will need them at some point.

@aronszekely, I would appreciate your insights on this if you have any. I'm happy to expand on the requirements if they're not clear.

aronszekely commented 6 years ago

I’m keeping this in mind and will let you know if I have anything useful to add.

On 4 Jun 2018, at 14:37, Nicolas Payette notifications@github.com wrote:

The numbers we have to measure the strength of the various factors that enter into the propensity and the risk equations are odds ratio. What we need for the model, however, are weights that, when normalized, add up to 1.0.

While it would be trivial to just normalize the odds ratio, I'm not sure this is the correct approach. My intuitive understand of odds ratio is limited, but I believe that an odds ratio of 1.0 means that something has no effect, so it should translate to a weight of 0.

Assuming all the odds ratio we use are greater or equal to one, a naive approach would be to just subtract one from each odds ratio and then normalize. That would mean that an odds ratio of 1.2 carries twice as much weight as an odds ratio of 1.1. Does it make sense?

This is not an urgent issue, as we won't need realistic weights until we start running analysis, but we will need them at some point.

@aronszekely https://github.com/aronszekely, I would appreciate your insights on this if you have any. I'm happy to expand on the requirements if they're not clear.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nicolaspayette/PROTON-T/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AZClTyZkB9ugPwXVT6EDpmbhEUF4_BOtks5t5Tf_gaJpZM4UZGks.

mariopaolucci commented 6 years ago

I was starting to look into this. Starting means I feel a need to restate the obvious, partly because nothing seems obvious to me right now. I'll follow a different approach wrt Nicolas' one.

Aron, can you read quickly and just give me your feeling if this is going in the right direction or not?

I will use as reference for standard naming the wikipedia voice on odds ratio

I assume we're talking about the table on p. 73-74 of deliverable 2.1 - the one starting like this:

Risk Factor Overall Beliefs Actions
Age 1.270 1.158 1.511
SES 1.001 Overall=1.381  

Assuming that these are really OR (and they look like from the numbers range), we need another information to convert them in probabilities: we need the marginal probabilities.

Let's take age as an example. OR are calculated on the four probabilities (the formula, OR=p11p00/(p01p10), is not important now).

-- action no action
aged p11 p10
not aged p01 p00  

To invert that formula we need (as the probabilities sup up to one) three figures - one is the OR, the other ones are usually the marginals, that is: P(aged) = p11 + p10 and P(action) = p11 + p01. We need these for each of the factors as all factors are collected from a different set of papers (the formula is on wikipedia, not important now).

This would return a list of the 4 probability values for every significant factor. From them, we can build synthetic populations with that OR. The next problem is how to build just ONE population, not one for each factor. Factors, I'm afraid, can't be considered independent, too.

naming things

I think that we preliminarily agreed that we can use beliefs for propensity, actions for "risk" (using risk as a name for that equation is really irksome). They use the name "risk" also as a title for the table that contain odds ratio...

There seems to be a labelling confusion here as most titles refer to risk, while the tables are marked OR for odds ratio They are not the same, even if:

OR and RR are usually comparable in magnitude when the disease studied is rare (eg, most cancers). However, an OR can overestimate and magnify risk, especially when the disease is more common (eg, hypertension) and should be avoided in such cases if RR can be used. from

aronszekely commented 6 years ago

There seems to me two difficulties in using probabilities instead of ORs (I assume that the ORs included in D2.1 come from papers that use multivariate logistic regressions for estimating them). If so:

1) The ORs are estimated conditional upon other variables. While ORs remain fixed, irrespective of the values of the other variables, the predicted probabilities are conditional upon the values that those other variables.

For instance, assume that there are two variables predicting recruitment: age and education. A one-unit increase in age, from 20 to 21, may increase the probability of recruitment from 20% to 25% when education is low, while at high education may increase the probability of recruitment from 20% to 30%. So the effect on probability that age has is conditional on the values of the other variables.

2) The other issue is that the effect in terms of probability also depends upon the level of the variable itself. Holding education constant (e.g. setting it at low), a one-unit increase in age, from 20 to 21 will have a different effect on the probability of recruitment than a one-unit increase in age at 60 to 61.

ORs are however constant irrespective of the value of the variable.

This may be helpful:

https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/

As well as the following notes:

Handout_7_2011.pdf Handout_8_2011.pdf

My suggestion would be to use a ratio of ORs. Specifically, I have in mind the following equation for the weight in the model:

weight = (1-OR)/max(1-OR)

This does two things: it weights the OR of a variable relative to the largest observed effect, and it adjusts the ORs so that a OR of 1 receives a 0 weighting.

Consider some of the factors reported in D2.1 pp. 73-74:

Risk Factor Overall OR 1-OR (1-OR)/max(1-OR)
Criminal history 6.002 5.002 1
Education 2.092 1.092 0.218312675
Immigrant 1.961 0.961 0.192123151
Employment 1.674 0.674 0.134746102
Marital status 1.417 0.417 0.083366653
Age 1.27 0.27 0.053978409
SES 1.001 0.001 0.00019992
Made up variable 1 0 0
nicolaspayette commented 6 years ago

weight = (1-OR)/max(1-OR)

This does two things: it weights the OR of a variable relative to the largest observed effect, and it adjusts the ORs so that a OR of 1 receives a 0 weighting.

This is very similar to what I initially had in mind, except that I envisioned the normalisation to be over the sum instead of being over the max. Adding a column to your table, we get:

Risk Factor Overall OR 1-OR (1-OR)/max(1-OR) (1-OR)/sum(1-OR)
Criminal history 6.002 5.002 1 0.5942734941
Education 2.092 1.092 0.218312675 0.1297374361
Immigrant 1.961 0.961 0.192123151 0.1141736961
Employment 1.674 0.674 0.134746102 0.0800760366
Marital status 1.417 0.417 0.083366653 0.0495425924
Age 1.27 0.27 0.053978409 0.0320779375
SES 1.001 0.001 0.00019992 0.0001188072
Made up variable 1 0 0 0

Is there any reason to prefer one over the other? Again, my intuitions about that kind of stuff are limited...

nicolaspayette commented 5 years ago

Following discussion with David, Michael and Badi this morning, it seems like we are going to get Cohen's d effect sizes.

The general strategy would then be to normalize all the effect sizes so that they sum up to one, and then use them as multipliers for the corresponding variables (which each of those also being normalized from 0 to 1, possible by transforming them into categorical variables (e.g., "high" = 1/"low" = 0).

It occurs to me, though, that we don't have an "effect size" for the role of "propensity" vs. the effect of the various topics in the "risk of radicalisation" part of the equation. Unless we get a chance to talk to David about that during dinner, you'll have to bring this up with them by email or in a subsequent meeting.

nicolaspayette commented 5 years ago

It occurs to me, though, that we don't have an "effect size" for the role of "propensity" vs. the effect of the various topics in the "risk of radicalisation" part of the equation.

Oh, I now realise that we already have what we need.

Suppose we have the following propensity factors (F1, F2) and topics (T1, T2), with the following effect sizes:

Effect size
F1 0.2
F2 0.6
T1 0.4
T2 0.8

The weight for the propensity component is thus 0.4 and the weight for the dynamic component is 0.6.

The key is to remember that, from the model's point of view, separating propensity from the dynamic part is just a way to avoid recomputing it at every tick.


Here are the latest effect sizes provided by Michael (https://github.com/LABSS/PROTON-T/pull/30):

Factor Effect size
Experienced discrimination 0.154
Unemployment 0.116
Negative Police contact 0.721
Criminal history 0.678
Male 0.203
Low education 0.313
High education -0.153
Immigrant 0.084
Collective relative deprivation 0.332
Low legitimacy 0.554
Hight trust -0.144
Age (ordinal) -0.074
Differential associations 0.416
Online diff. associations 0.219

A few things to note:

mariopaolucci commented 5 years ago

(from HUJI)

For the Islamic radicalization model: All effect sizes/weights are presented as Cohen's d Where possible, effect sizes/weights were derived from moderator analysis limited to Islamic ideology and/or EU region. *Age is measured continuously

Propensity Table:

Propensity= Gender+ Age+ Unempl.+ Criminal + history Immigrant+ (1st gen.) Authoritarian/ fundamentalist
  .113 -.099 .208 .678 .081 .90

Risk Table:

Risk= Propensity+ Subjective + deprivation Integration + Legitimacy
  X .232 .376 (low) .554 (low)
      -.355 (high) -.306 (high)
mariopaolucci commented 5 years ago

(from HUJI)

The first major change is that we are not going to measure age as a continuous factor for the propensity equation. Rather we will automatically give any subject that will be younger than 25 at the time of initiation of the model an added value of 0.1 as an addition to their propensity scores.

In the meeting, we also decided that only people aged 16+ exist in the model - so age counts as 16-25 +0.1

As such, the new propensity formula looks as follows:

Propensity= Gender+ Unemploy.+ Criminal + history Immigrant+ (1st gen.) Authoritarian/ fundamentalist age(16-25)
  .113 .208 .678 .081 .90 .1

Secondly, with regards to two of the three opinion topics, integration and legitimacy, we will work on a scale of -2 to +2, that is -2, -1, 0, +1, +2. This will enable us to take advantage of the fact that we have separate effects for protective and risk factors. For the third factor, subjective deprivation, given that we do not have a protective factor effect, we will simply work off of a 0 to +2 scale, where +1 is equal to half of the risk effect.

Here, we discussed how to scale the [-1,1] opinion dynamics to the the -2. 2 scale. The decision is we just multiply by 2 and do linear interpolation.

Risk= Propensity+ Subjective    +deprivation Integration + Legitimacy
         
+2   .232 .376 .554
+1   .116 .188 .277
0   0 0 0
-1     -.178 -.153
-2     -.355 -.306

This approach is in line with the interpretation of the effect size statistic, Cohen's d, as representing the mean change in the standard deviations. It also takes advantage of the fact that the variables in the propensity formula are dichotomous, whereas the variables in the risk formula are all ordinal level. Of course you can scale this to changes at the .1 level between 0 and 1, as well as 1 and 2.

Additionally, by taking this approach there is essentially equal weight given to the propensity and risk formulas, at least in terms of the maximum scores that can be generated by each (2 for each).

It actually is around 2 for propensity, around 1 for risk, max of three.

We are still working on putting together the empirical distributions of the opinion topics but we should have this ready quite soon.

Michael, on behalf of David and Badi