XAI-ANITI / ethik

:mag_right: A toolbox for fair and explainable machine learning
https://xai-aniti.github.io/ethik/
GNU General Public License v3.0
53 stars 5 forks source link

Query building in CacheExplainer() doesn't avoid data holes #116

Open Vayel opened 4 years ago

Vayel commented 4 years ago

Basically, we are doing this:

low, high = X_test.quantile(q=[alpha, 1.0 - alpha])
taus = np.linspace(low, high, num=n_taus)

It gives us:

Top: density of data samples. Bottom: influence on fake `y_pred` data Top: density of data samples.
Bottom: influence on fake y_pred data

Instead, we should be doing this:

q = np.linspace(alpha, 1.0 - alpha, num=n_taus)
taus = X_test.quantile(q=q)

The problem is that we currently have the convention tau == 0 being the mean. But the mean probably doesn't correspond to a quantile in q.

@MaxHalford I would suggest to get rid of taus and just talk about quantiles (with a special value to identify the original mean).

MaxHalford commented 4 years ago

Sure that makes more sense! I remember using quantiles very early on in the project and it worked just as well.

lrisser commented 4 years ago

good idea to me too

Le 2019-11-20 16:11, Max Halford a écrit :

Sure that makes more sense! I remember using quantiles very early on in the project and it worked just as well.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRPHTDZJS3MU6DXT4RLQUVHT7A5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESJP7Q#issuecomment-556046334", "url": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRPHTDZJS3MU6DXT4RLQUVHT7A5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESJP7Q#issuecomment-556046334", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications&email_token=AELJGRPHTDZJS3MU6DXT4RLQUVHT7A5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESJP7Q#issuecomment-556046334 [2] https://github.com/notifications/unsubscribe-auth/AELJGRJBNXPWI4QYY3R6JLDQUVHT7ANCNFSM4JPUAWCQ

Vayel commented 4 years ago

Looks better:

image

Vayel commented 4 years ago

But not always:

image

image

It's not that easy!

Vayel commented 4 years ago

Actually, it doesn't really make sense to focus on quantiles only as we are talking about the mean. There's no reason why it should be equal to a quantile, especially for binary features (for which about 50% of the samples are equal to 0 and the rest is equal to 1).

Instead, I suggest to keep the current behaviour but to tell the user when a target mean is unrealistic (like 25 on the plot below).

image

Perhaps we could enable the user to define a threshold on a criterion (like the proportion of individuals who capture 50% of the weight) and use it to filter the target means?

MaxHalford commented 4 years ago

Maybe a stupid question: shouldn't the confidence interval be large around unrealistic values?

We could also add the KDE of the values at the bottom of the plot, a bit like what is done here. This would give a visual cue of unreliable regions.

Vayel commented 4 years ago

I'll check for the confidence interval.

I'd say the KDE is not sufficient. On the plot above, the original density (the black curve) has a similar value at age = 25 than at age = 38 (the original mean). Yet, shifting the mean to 25 gives a distribution that differs quite a lot from the original. The density contains all the information but is not easy enough to read I guess.

lrisser commented 4 years ago

I totally agree with this idea!

regards, Laurent

Le 2019-11-21 13:46, Vincent Lefoulon a écrit :

Actually, it doesn't really make sense to focus on quantiles only as we are talking about the mean. There's no reason why it should be equal to a quantile, especially for binary features (for which about 50% of the samples are equal to 0 and the rest is equal to 1).

Instead, I suggest to keep the current behaviour but to tell the user when a target mean is unrealistic (like 25 on the plot below).

[1]

Perhaps we could enable the user to define a threshold on a criterion (like the proportion of individuals who capture 50% of the weight) and use it to filter the target means?

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRN475YJBFMBDUUSLJTQUZ7IRA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2DGRI#issuecomment-557069125", "url": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRN475YJBFMBDUUSLJTQUZ7IRA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2DGRI#issuecomment-557069125", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://user-images.githubusercontent.com/6124369/69339097-e9fa9d80-0c64-11ea-9a73-1d12d4354e3c.png [2] https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications&email_token=AELJGRN475YJBFMBDUUSLJTQUZ7IRA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2DGRI#issuecomment-557069125 [3] https://github.com/notifications/unsubscribe-auth/AELJGRN6HVX5A2MNXMOAHF3QUZ7IRANCNFSM4JPUAWCQ

lrisser commented 4 years ago

... to me the confidence interval should indeed be high when a little amount of observations support most of the weights... using these intervals in standard graphs is a way to check how confident we are in the curves. Mentioning to the user that not enough observations have more than say 50% of the weights is also a good alternative.

Le 2019-11-21 13:54, Vincent Lefoulon a écrit :

I'll check for the confidence interval.

I'd say the KDE is not sufficient. On the plot above, the original density (the black curve) has a similar value at age = 25 than at age = 38 (the original mean). Yet, shifting the mean to 25 gives a distribution that differs quite a lot from the original. The density contains all the information but is not easy enough to read I guess.

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRIJ33RCU77TMP644ITQU2AHXA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2D5II#issuecomment-557072033", "url": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRIJ33RCU77TMP644ITQU2AHXA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2D5II#issuecomment-557072033", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications&email_token=AELJGRIJ33RCU77TMP644ITQU2AHXA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2D5II#issuecomment-557072033 [2] https://github.com/notifications/unsubscribe-auth/AELJGRN5Y33TWYQNYKBEIXDQU2AHXANCNFSM4JPUAWCQ

Vayel commented 4 years ago

Unfortunately, the confidence interval doesn't "work":

image

We should have a large interval for extreme ages. It's more or less the case for old people but not for young ones.

lrisser commented 4 years ago

did the algorithm crashed or do you believe this is for another reason? We can talk about it...

Le 2019-11-21 14:29, Vincent Lefoulon a écrit :

Unfortunately, the confidence interval doesn't "work":

[1]

We should have a large interval for extreme ages.

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRNPF7J2NYUUGJMYF33QU2EMFA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2HB7A#issuecomment-557084924", "url": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRNPF7J2NYUUGJMYF33QU2EMFA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2HB7A#issuecomment-557084924", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://user-images.githubusercontent.com/6124369/69342142-33e68200-0c6b-11ea-9163-8d15239202bc.png [2] https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications&email_token=AELJGRNPF7J2NYUUGJMYF33QU2EMFA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2HB7A#issuecomment-557084924 [3] https://github.com/notifications/unsubscribe-auth/AELJGROQTWVEGH2QOB6UHFTQU2EMFANCNFSM4JPUAWCQ

Vayel commented 4 years ago

No, it didn't crash. Are you here tomorrow?

Here is a plot we could do:

image

lrisser commented 4 years ago

yes in the afternoon... let's then talk after lunch

Le 2019-11-21 16:35, Vincent Lefoulon a écrit :

No, it didn't crash. Are you here tomorrow?

Here is a plot we could do:

[1]

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [2], or unsubscribe [3]. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRJKEPSBVMXP6U3K3ADQU2TERA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2UHOI#issuecomment-557138873", "url": "https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications\u0026email_token=AELJGRJKEPSBVMXP6U3K3ADQU2TERA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2UHOI#issuecomment-557138873", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Links:

[1] https://user-images.githubusercontent.com/6124369/69352275-effc7880-0c7c-11ea-8874-4bf1e72a312f.png [2] https://github.com/XAI-ANITI/ethik/issues/116?email_source=notifications&email_token=AELJGRJKEPSBVMXP6U3K3ADQU2TERA5CNFSM4JPUAWC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE2UHOI#issuecomment-557138873 [3] https://github.com/notifications/unsubscribe-auth/AELJGRMM5S6OAD37BYC3PB3QU2TERANCNFSM4JPUAWCQ

Vayel commented 4 years ago

Some plots about KDE. Chart above is the 2D explanation with ethik. Chart below is the dataset density with points sampled from it.

We can see that we can reach target means where there are no points, so the density doesn't seem to be a good criterion to find the valid target means.

image

image

image

image