desmarais-lab / donation_diffusion

Code for the project: Diffusion of Candidates in Campaign Finance Networks
0 stars 0 forks source link

Results are in #3

Open flinder opened 6 years ago

flinder commented 6 years ago

@bdesmarais I just pushed all the networks we have so far to Box/Strategic Donors/Data/results/. File names should be self explanatory. If you need the edgelist for your analysis, each network is saved as a .RData file. The network I describe is netinf_threshold_6_iter_1.RData.

I got some results for the latest netinf run. The largest population we have results for is threshold 8 (or ~9000 donors). Here's the distribution for donors in the network (TRUE) and isolates (FALSE):

  matched n_donors min_donations max_doations mean_donations median_donations mean_amount median_amount
    <lgl>    <int>         <dbl>        <dbl>          <dbl>            <int>       <dbl>         <int>
1   FALSE     7387             7           26       9.821578                9    9928.585          6200
2    TRUE     2085            19          549      64.155396               37  120562.457         63317

Here's a plot of the distributions

Seems to me like with 9000 donors we are already including way to many donors. If we remove all donors/candidates with less than 19 (lowest number of donations in non-isolate group) recipients/donations, we get 1958 donors in the dataset used for netinf.

What do you think?

Results can be reproduced with the script (/analysis/interpret_netinf_output.R). You can also use this if you want to extract a network inferred from a specific threshold and join it with the donor data (just change the threhold value in line 12).

bdesmarais commented 6 years ago

Hi Frido,

Its fantastic that this is running so fast!

So if I am reading this correctly, is it accurate to say that the nodes added at a threshold of 8 were all isolates in the network identified by netinf? If so, let's stop there. I will try to get the re-analysis done tomorrow morning.

-Bruce


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Thu, Mar 1, 2018 at 5:17 PM, Fridolin Linder notifications@github.com wrote:

@bdesmarais https://github.com/bdesmarais I just pushed all the networks we have so far to Box/Strategic Donors/Data/results/. File names should be self explanatory. If you need the edgelist for your analysis, each network is saved as a .RData file. The network I describe is netinf_threshold_6_iter_1.RData.

I got some results for the latest netinf run. The largest population we have results for is threshold 8 (or ~9000 donors). Here's the distribution for donors in the network (TRUE) and isolates (FALSE):

matched n_donors min_donations max_doations mean_donations median_donations mean_amount median_amount

1 FALSE 7387 7 26 9.821578 9 9928.585 6200 2 TRUE 2085 19 549 64.155396 37 120562.457 63317 Here's a plot of the distributions Seems to me like with 9000 donors we are already including way to many donors. If we remove all donors/candidates with less than 19 (lowest number of donations in non-isolate group) recipients/donations, we get 1958 donors in the dataset used for netinf. What do you think? Results can be reproduced with the script (/analysis/interpret_netinf_output.R). You can also use this if you want to extract a network inferred from a specific threshold and join it with the donor data (just change the threhold value in line 12 ). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
flinder commented 6 years ago

@bdesmarais I must have been tired yesterday. I just re-ran the code and I find actually all donors are in the network for both the threshold 6 and 5. I must have had an old, smaller network in memory or something. I'll look more into it, just wanted to let you know quickly

flinder commented 6 years ago

I did some more digging and it turns out, in every network, every donor has at least one incoming edges, but only a smaller fraction of donors are sending edges. Furthermore, the proportion of possible nodes (donors in the data) that are actually origins of the total number of donors in the data is decreasing when adding more donors. See here:

# A tibble: 8 x 6
  threshold n_donors_in_network_origin n_donors_in_network_destination perc_in_data_origin n_donors_in_data n_edges
      <dbl>                      <int>                           <int>               <dbl>            <int>   <int>
1         5                       2518                           12404           0.2029990            12404   23962
2         6                       2174                            9472           0.2295186             9472   21839
3         8                       1626                            6134           0.2650799             6134   17427
4        10                       1302                            4340           0.3000000             4340   14948
5        12                       1044                            3373           0.3095168             3373   12778
6        14                        956                            2753           0.3472575             2753   12194
7        16                        842                            2380           0.3537815             2380   11367
8        18                        791                            2085           0.3793765             2085   10665
flinder commented 6 years ago

I wonder how low the p-value cutoff (or number of edges) has to be to see true isolate nodes. We could check that with the data we have

flinder commented 6 years ago

Here's the same data with a 0.05 p-value cutoff:

# A tibble: 8 x 6
  threshold n_donors_in_network_origin n_donors_in_network_destination perc_in_data_origin n_donors_in_data n_edges
      <dbl>                      <int>                           <int>               <dbl>            <int>   <int>
1         5                       1747                           12390           0.1408417            12404   20399
2         6                       1396                            9472           0.1473818             9472   17641
3         8                        966                            6134           0.1574829             6134   13285
4        10                        827                            4340           0.1905530             4340   11439
5        12                        727                            3373           0.2155351             3373   10286
6        14                        662                            2753           0.2404649             2753    9537
7        16                        603                            2380           0.2533613             2380    8921
8        18                        568                            2085           0.2724221             2085    8409
bdesmarais commented 6 years ago

So it looks like if we use a threshold of 5 we are at least adding some nodes to the network that are isolates. How long do you think it would take to run this with p=0.05 and a threshold of 4?


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Fri, Mar 2, 2018 at 10:40 AM, Fridolin Linder notifications@github.com wrote:

Here's the same data with a 0.05 p-value cutoff:

A tibble: 8 x 6

threshold n_donors_in_network_origin n_donors_in_network_destination perc_in_data_origin n_donors_in_data n_edges

1 5 1747 12390 0.1408417 12404 20399 2 6 1396 9472 0.1473818 9472 17641 3 8 966 6134 0.1574829 6134 13285 4 10 827 4340 0.1905530 4340 11439 5 12 727 3373 0.2155351 3373 10286 6 14 662 2753 0.2404649 2753 9537 7 16 603 2380 0.2533613 2380 8921 8 18 568 2085 0.2724221 2085 8409 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .
flinder commented 6 years ago

The threshold 4 model is currently at 16K edges with a p-value of ~0.01. I started it on Tuesday I think, so I probably will take another 4 days or so. That's 17300 donors btw.

flinder commented 6 years ago

Also note that in the threshold 5 model still all donors are included in the network. The 14 nodes that are not destinations seem to be all origins

bdesmarais commented 6 years ago

How many possible edges are there in the threshold 4 network?


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Fri, Mar 2, 2018 at 12:17 PM, Fridolin Linder notifications@github.com wrote:

The threshold 4 model is currently at 16K edges with a p-value of ~0.01. I started it on Tuesday I think, so I probably will take another 4 days or so. That's 17300 donors btw.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-369988030, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKTlh7rqYIRylC3P0bpulYmgLB0trks5taX5mgaJpZM4SZFZE .

flinder commented 6 years ago

78,007,588

 cascades: 925
 nodes: 17372
 nodes in cascades: 17372
 possible edges: 78007588

Summary statistics for cascade length and number of ties:
           length      ties
Min.       5.0000    0.0000
1st Qu.   39.0000    5.0000
Median   168.0000   36.0000
Mean     269.2649  131.3395
3rd Qu.  300.0000   95.0000
Max.    5445.0000 4891.0000
bdesmarais commented 6 years ago

OK, so the network inferred at p=.01 is still very sparse.


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Fri, Mar 2, 2018 at 1:52 PM, Fridolin Linder notifications@github.com wrote:

78,007,588

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-370017168, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKbNl0Vru2YhmyJ3ZZQlvBA87etTvks5taZTqgaJpZM4SZFZE .

flinder commented 6 years ago

At 0.01 we get 13,200 edges in the threshold 5 network

flinder commented 6 years ago

which is still 0.9997% sparse

bdesmarais commented 6 years ago

It may be the case that the p-value should be much lower than 0.05 with a large number of nodes. Would you try out the following experiment?

  1. Take one of the networks inferred at a high threshold and low p-value (e.g., threshold = 10, p-val = 0.01).

  2. Simulate the same number of cascades we have in the data from this network.

  3. Run netinf on the data simulated in Step (2) to see the p-value at which we would infer the correct number of edges.

  4. Repeat steps 2 and 3 ten times to see if there is a p-value that seems to work well.

-Bruce


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Fri, Mar 2, 2018 at 1:59 PM, Fridolin Linder notifications@github.com wrote:

which is still 0.9997% sparse

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-370019195, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKYPlnO4-Y3u0Ajv_fJ-hw7iiO2E2ks5taZaogaJpZM4SZFZE .

flinder commented 6 years ago

@bdesmarais, I just did steps 1 and 2 and realized the simulated cascades are relatively different from the original ones. Do you think this should concern us:

Original cascades summary:

# cascades: 681
# nodes: 2085
# nodes in cascades: 2085
# possible edges: 3076653

Summary statistics for cascade length and number of ties:
          length     ties
Min.     19.0000   0.0000
1st Qu. 103.0000  13.0000
Median  162.0000  32.0000
Mean    194.6006  54.1953
3rd Qu. 257.0000  69.0000
Max.    835.0000 488.0000

Simulated:

# cascades: 681
# nodes: 2085
# nodes in cascades: 2085
# possible edges: 4345132

Summary statistics for cascade length and number of ties:
           length ties
Min.       1.0000    0
1st Qu.    1.0000    0
Median     1.0000    0
Mean     506.3568    0
3rd Qu.   10.0000    0
Max.    2085.0000    0
bdesmarais commented 6 years ago

That definitely looks like a poor fit. Does this improve if we use the log-normal distribution?


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Tue, Mar 6, 2018 at 1:47 PM, Fridolin Linder notifications@github.com wrote:

@bdesmarais https://github.com/bdesmarais, I just did steps 1 and 2 and realized the simulated cascades are relatively different from the original ones. Do you think this should concern us: Original cascades summary:

cascades: 681

nodes: 2085

nodes in cascades: 2085

possible edges: 3076653

Summary statistics for cascade length and number of ties: length ties Min. 19.0000 0.0000 1st Qu. 103.0000 13.0000 Median 162.0000 32.0000 Mean 194.6006 54.1953 3rd Qu. 257.0000 69.0000 Max. 835.0000 488.0000

Simulated:

cascades: 681

nodes: 2085

nodes in cascades: 2085

possible edges: 4345132

Summary statistics for cascade length and number of ties: length ties Min. 1.0000 0 1st Qu. 1.0000 0 Median 1.0000 0 Mean 506.3568 0 3rd Qu. 10.0000 0 Max. 2085.0000 0

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-370885784, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKfQiftGFXs9DHrGexmgAxrdcd301ks5tbtl8gaJpZM4SZFZE .

flinder commented 6 years ago

I checked it with a smaller donor set. Fit is slightly better but still not good:

# cascades: 558
# nodes: 717
# nodes in cascades: 717
# possible edges: 505513

Summary statistics for cascade length and number of ties:
          length      ties
Min.     51.0000   0.00000
1st Qu.  93.0000  10.00000
Median  138.0000  24.00000
Mean    158.5287  35.49642
3rd Qu. 199.7500  46.00000
Max.    504.0000 235.00000

Simulated:

# cascades: 558
# nodes: 717
# nodes in cascades: 717
# possible edges: 513352

Summary statistics for cascade length and number of ties:
           length ties
Min.      1.00000    0
1st Qu.   1.00000    0
Median    1.00000    0
Mean     49.17742    0
3rd Qu.   1.00000    0
Max.    717.00000    0
bdesmarais commented 6 years ago

It is unclear to me how, under the simulation algorithm, it is possible for a cascade to stop at one node since we simulate finite diffusion times from each node to every other node. Do you know what is happening when a cascade stops at a single node?

flinder commented 6 years ago

The only possibilities that I see (besides a bug) is that the diffusion times are all over the censoring time (which I set to 2 years so that should be impossible under the distribution) or that we are somehow starting cascades at isolate nodes. I'll look into it.

bdesmarais commented 6 years ago

That raises one thought. In netinf the initial node in the cascade is taken as given. We should simulate the cascades starting with the same nodes we observe in the data.

On Thu, Mar 8, 2018 at 8:01 AM Fridolin Linder notifications@github.com wrote:

The only possibilities that I see (besides a bug) is that the diffusion times are all over the censoring time (which I set to 2 years so that should be impossible under the distribution) or that we are somehow starting cascades at isolate nodes. I'll look into it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-371480119, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKcf1nz6YsjXkm1CzSc80b-5sVfuFks5tcSuVgaJpZM4SZFZE .

--

Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

flinder commented 6 years ago

Ok that makes sense. The cascades I posted here are simulated with using the empirical start probabilities (proportion of cascades each node is the first). So I fear it won't change the fit much but I will try starting deterministic with the same start nodes

flinder commented 6 years ago

I figured out where the length one cascades in the simulation come from. It's nodes that are in the network but only with incoming edges. So no other edge can be reached from them. We start from them anyways because they are in some cascades the first adopters. In the example data here, of a total of 717 donors, there are 556 donors that are either not in the network or just as destinations. This is the distribution of how often these 556 are first adopters:

  0   1   2   3   4 
356 153  37   6   4 

I.e. 356 of the 556 are never first adopters, etc..

So the problem might arise from not inferring enough edges in the first place, meaning this would be an indication of poor fit of the model (with the number of edges being an additional parameter).

It seems to me like this would also be an interesting way of checking the model fit. In a posterior predictive check sense.

I'm currently checking if the fit improves if we infer more edges

flinder commented 6 years ago

With a larger network, the cascades spread just to all nodes within the max diffusion time:

# cascades: 558
# nodes: 717
# nodes in cascades: 717
# possible edges: 513372

Summary statistics for cascade length and number of ties:
          length ties
Min.      1.0000    0
1st Qu. 717.0000    0
Median  717.0000    0
Mean    713.1505    0
3rd Qu. 717.0000    0
Max.    717.0000    0

Looking at a sample of 10 cascades in original and simulated data also shows that the diffusion happens much faster in the simulation.

I normalized the original data by subtracting the first diffusion time from all times in each cascade. That made me realize we should probably also have the simulation function generate data on the same time scale as the original data. I'm not sure if this is happening in our code, but it also seems like that could shorten the cascades (since depending on the start time, not every cascade has the same amount of time to 'play out')

bdesmarais commented 6 years ago

I see...this may indiate a more fundamental issue regarding how the diffusion time likelihoods are approximated in netinf. Only the observed exact diffusion times are taken into account. However, for each cascade incident, we know both the exact diffusion time between the sender and recipient, and that the diffusion times from the other potential senders exceeded that of the sender-recipient pair.

Suppose A, B, and C adopted before D, and we infer that B was D's source. The likelihood of that event is not only f(t_D-tB), but f{D|B}(t_D-tB)*(1-F{D|C}(t_D-tC))*(1-F{D|A}(t_D-t_A)). This second expression reflects the fact that the D->B time was the first past the post among three potential sources. F() is a cumulative distribution that depends on whether the potential source sends a diffusion tie to the recipient.

We may need to change how the diffusion times likelihood factors into netinf. For each adoption, there would be a term added to the log likelihood that corresponded to each node that adopted previously. And in the duration model estimation step, we should incorporate a term for each non-adopting node that reflects the probability of the node not adopting within the observation window. This will definitely add some complexity, but what netinf currently does is ignores censored diffusion times. It makes sense that we would infer distributions with overly fast diffusion times, since we are currently omitting information about lots of long diffusion times.


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Thu, Mar 8, 2018 at 1:43 PM, Fridolin Linder notifications@github.com wrote:

With a larger network, the cascades spread just to all nodes within the max diffusion time:

cascades: 558

nodes: 717

nodes in cascades: 717

possible edges: 513372

Summary statistics for cascade length and number of ties: length ties Min. 1.0000 0 1st Qu. 717.0000 0 Median 717.0000 0 Mean 713.1505 0 3rd Qu. 717.0000 0 Max. 717.0000 0

Looking at a sample of 10 cascades in original https://www.dropbox.com/s/8dzs9jon6atsvsw/original_cascades_sample.png?dl=0 and simulated https://www.dropbox.com/s/ynq86cj0egk1vay/simulated_cascades_sample.png?dl=0 data also shows that the diffusion happens much faster in the simulation.

I normalized the original data by subtracting the first diffusion time from all times in each cascade. That made me realize we should probably also have the simulation function generate data on the same time scale as the original data. I'm not sure if this is happening in our code, but it also seems like that could shorten the cascades (since depending on the start time, not every cascade has the same amount of time to 'play out')

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/donation_diffusion/issues/3#issuecomment-371583122, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKbOGlQ0fohfbkGHX3GOIFyytjOVeks5tcXvNgaJpZM4SZFZE .