eqasim-org / ile-de-france

An open synthetic population of Île-de-France for agent-based transport simulation
GNU General Public License v2.0
51 stars 71 forks source link

Statistical bug in primary locations attribution #237

Closed jcombaz closed 5 months ago

jcombaz commented 6 months ago

In the file synthesis/population/spatial/primary, the following code introduces significant spatial correlations between home and primary locations, since duplicated values are consecutive in the data generated by numpy.repeat (which is used to sample locations among candidates). In particular, the probability of two commutes having the exact same origins and destinations is significantly larger than what it should be.

   location_ids = np.repeat(location_ids, location_counts)

    # Construct a data set for all commutes to this zone
    origin_id = np.repeat(df_flow["origin_id"].values, df_flow["count"].values)

    df_result = pd.DataFrame.from_records(dict(
        origin_id = origin_id,
        location_id = location_ids
    ))

To fix this the attribution of origins to destinations should be made independent of the order in the data sets, e.g.:

   location_ids = np.repeat(location_ids, location_counts)

    # Construct a data set for all commutes to this zone
    origin_id = np.repeat(df_flow["origin_id"].values, df_flow["count"].values)

    np.random.shuffle(origin_id)
    np.random.shuffle(location_ids)

    df_result = pd.DataFrame.from_records(dict(
        origin_id = origin_id,
        location_id = location_ids
    ))
sebhoerl commented 6 months ago

Hi Jacques, thanks for bringing this up. I just thought it through, and I think I agree. Though shuffling one is probably sufficient. I have the feeling that maybe the multinomial sampler was implemented differently before at some point and then this error got introduced.

Just out of interest, do you have any statistical analysis that shows the before and after?

jcombaz commented 6 months ago

Hi!

Yes, I think one shuffing is enough but I haven't tested it.

I discovered this "statistical bug" when developing alternative approaches to Aina's analysis of carpooling. Here is an example of an analysis of commute trips based on the synthetic population, in which two trips can be carpooled iff their origins, destinations and departure times are close enough. The following graph shows the total number of "carpoolable" kilometers depending on the spatial constraint on origins and destinations (in meters). The green curve corresponds to the case where origins and destinations are not shuffled, and the purple curve when they are shuffled. The relative difference between these two curves can be significant for smaller values of the spatial parameter.

Regards,

Jacques.

-- Verimag - Bâtiment IMAG - Université Grenoble Alpes 150 place du torrent 38401 Saint Martin d’Hères @.*** www-verimag.imag.fr/~jcombaz phone: +33 (0)4 57 42 22 10 fax: +33 (0)4 57 42 22 22

De: "Sebastian Hörl" @.> À: "eqasim-org/ile-de-france" @.> Cc: "Jacques Combaz" @.>, "Author" @.> Envoyé: Lundi 27 Mai 2024 18:26:43 Objet: Re: [eqasim-org/ile-de-france] Statistical bug in primary locations attribution (Issue #237)

Hi Jacques, thanks for bringing this up. I just thought it through, and I think I agree. Though shuffling one is probably sufficient. I have the feeling that maybe the multinomial sampler was implemented differently before at some point and then this error got introduced.

Just out of interest, do you have any statistical analysis that shows the before and after?

— Reply to this email directly, [ https://github.com/eqasim-org/ile-de-france/issues/237#issuecomment-2133800761 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AGNBKMA67ALEBJUJEQWX4JDZENNEHAVCNFSM6AAAAABILEPA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTHAYDANZWGE | unsubscribe ] . You are receiving this because you authored the thread. Message ID: @.***>

sebhoerl commented 6 months ago

Ok, makes sense. Do you want to create a PR with the fix? Otherwise I can look into it end of the week.

jcombaz commented 6 months ago

Yes, I can do that. Should I do my PR directly on the develop branch?

Jacques.

-- Verimag - Bâtiment IMAG - Université Grenoble Alpes 150 place du torrent 38401 Saint Martin d’Hères @.*** www-verimag.imag.fr/~jcombaz phone: +33 (0)4 57 42 22 10 fax: +33 (0)4 57 42 22 22

De: "Sebastian Hörl" @.> À: "eqasim-org" @.> Cc: "Jacques Combaz" @.>, "Author" @.> Envoyé: Mardi 28 Mai 2024 16:14:56 Objet: Re: [eqasim-org/ile-de-france] Statistical bug in primary locations attribution (Issue #237)

Ok, makes sense. Do you want to create a PR with the fix? Otherwise I can look into it end of the week.

— Reply to this email directly, [ https://github.com/eqasim-org/ile-de-france/issues/237#issuecomment-2135326405 | view it on GitHub ] , or [ https://github.com/notifications/unsubscribe-auth/AGNBKMEDCHKP7QQWJOVZKHLZESGOBAVCNFSM6AAAAABILEPA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGMZDMNBQGU | unsubscribe ] . You are receiving this because you authored the thread. Message ID: @.***>

sebhoerl commented 6 months ago

Yes, thanks!


From: jcombaz @.> Sent: 28 May 2024 16:36 To: eqasim-org/ile-de-france @.> Cc: Sebastian HORL @.>; Comment @.> Subject: Re: [eqasim-org/ile-de-france] Statistical bug in primary locations attribution (Issue #237)

Yes, I can do that. Should I do my PR directly on the develop branch?

Jacques.

-- Verimag - Bâtiment IMAG - Université Grenoble Alpes 150 place du torrent 38401 Saint Martin d’Hères @.*** www-verimag.imag.fr/~jcombaz phone: +33 (0)4 57 42 22 10 fax: +33 (0)4 57 42 22 22

De: "Sebastian Hörl" @.> À: "eqasim-org" @.> Cc: "Jacques Combaz" @.>, "Author" @.> Envoyé: Mardi 28 Mai 2024 16:14:56 Objet: Re: [eqasim-org/ile-de-france] Statistical bug in primary locations attribution (Issue #237)

Ok, makes sense. Do you want to create a PR with the fix? Otherwise I can look into it end of the week.

— Reply to this email directly, [ https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2feqasim%2dorg%2file%2dde%2dfrance%2fissues%2f237%23issuecomment%2d2135326405&umid=0f209477-7367-47e0-9804-7490e3a439b2&auth=b6005005a7b50bc3a68b2003f1e38d069f93f262-7ecdbc96cc0071acbaa3bec49c805354fc04615a | view it on GitHub ] , or [ https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fnotifications%2funsubscribe%2dauth%2fAGNBKMEDCHKP7QQWJOVZKHLZESGOBAVCNFSM6AAAAABILEPA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGMZDMNBQGU&umid=0f209477-7367-47e0-9804-7490e3a439b2&auth=b6005005a7b50bc3a68b2003f1e38d069f93f262-8bf2fcdb6870404c38e083e9274e12861dbf8609 | unsubscribe ] . You are receiving this because you authored the thread. Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2feqasim%2dorg%2file%2dde%2dfrance%2fissues%2f237%23issuecomment%2d2135396002&umid=0f209477-7367-47e0-9804-7490e3a439b2&auth=b6005005a7b50bc3a68b2003f1e38d069f93f262-ee06a7da9fdb87251a482b2a59151f986101f9fb, or unsubscribehttps://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fnotifications%2funsubscribe%2dauth%2fAAE6CTHTNNQSIMNRRYDPVGTZESJABAVCNFSM6AAAAABILEPA62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGM4TMMBQGI&umid=0f209477-7367-47e0-9804-7490e3a439b2&auth=b6005005a7b50bc3a68b2003f1e38d069f93f262-7dc3e4811ca49903bbe631281f7494076a66ce75. You are receiving this because you commented.Message ID: @.***>