drivendataorg / deon

A command line tool to easily add an ethics checklist to your data science projects.
https://deon.drivendata.org/
MIT License
289 stars 52 forks source link

Add human review #140

Closed glipstein closed 2 years ago

glipstein commented 2 years ago

Addresses issue #130

Finally getting this PR in - thanks @ejm714 for the pointer to the instructions as they're getting updated

This includes one example already. If others come to mind feel free to add them as well!

codecov[bot] commented 2 years ago

Codecov Report

Merging #140 (13ea3b5) into main (af69d34) will not change coverage. The diff coverage is n/a.

:exclamation: Current head 13ea3b5 differs from pull request most recent head f3f1424. Consider uploading reports for the commit f3f1424 to get more accurate results

@@           Coverage Diff           @@
##             main     #140   +/-   ##
=======================================
  Coverage   96.80%   96.80%           
=======================================
  Files           6        6           
  Lines         188      188           
=======================================
  Hits          182      182           
  Misses          6        6           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3907956...f3f1424. Read the comment docs.

ejm714 commented 2 years ago

Definitely appreciate this item on the checklist as it feels like the precursor to redress. Having a plan for response if users are harmed is only useful if we have a plan to monitor if humans are harmed.

I could see this being integrated into the redress item, e.g. what is our plan to monitor and respond... but no strong preference either way

~Per the WIP contribution instructions, after you modify those two yml files (as is correctly done), you need to run make reqs and make build and commit all the changed files: https://github.com/drivendataorg/deon/pull/139/files#diff-eca12c0a30e25b4b46522ebf89465a03ba72a03f540796c979137931d8f92055R51-R60~

~This will update the example checklist files that live here: https://github.com/drivendataorg/deon/tree/main/examples~

Edit: I removed the output from make build given that https://github.com/drivendataorg/deon/pull/141 was merged

glipstein commented 2 years ago

Thanks @ejm714 !

Also just jotting down from our chat: agreed this could work separately or integrated with redress. Primary thing to consider probably is to what extent the action items / conversations are happening together for setting up human monitoring of impacts and planning redress.

I could see some separation there (e.g., what's the redress process if there is a complaint vs how to review model outputs on an ongoing basis) but also don't have a strong preference. I do think the examples will likely overlap. It may also be that E.2 could be shortened a bit in light of E.1

pjbull commented 2 years ago

For me, this is a specific case of a more general practice that "concept drift" also falls into. I'd suggest that we call this "Monitoring and evaluation" and remove the separate "concept drift" item.

Text would be something like:

How are we planning to monitor the algorithm performance after deployed? Are high stakes decisions reviewed by humans? Is there a regular audit of a random set of predictions? How are we monitoring for concept drift to ensure the model remains fair over time?

glipstein commented 2 years ago

I like that. I agree those fit in the same item.

I'll take a pass at an update. I'm still not sure the questions get at the human impacts of the model at scale (this seems different from "algorithm performance"), and that someone needs to be accountable for monitoring those consequences. This may be what you're angling towards with "are high stakes decisions reviewed by humans?", though that sounds more like Facebook's review board which is a different thing. Any suggestions welcome.

here is the example, which is a blatant case of a broad issue https://www.vice.com/en/article/jgq35d/how-a-discriminatory-algorithm-wrongly-accused-thousands-of-families-of-fraud

pjbull commented 2 years ago

I'm still not sure the questions get at the human impacts of the model at scale (this seems different from "algorithm performance")

Yeah, maybe we can add something to that question about if the algorithmic decisions are causing downstream impacts that need evaluation—not just if the decision itself is wrong. That said, the example that we are citing isn't strictly different from algorithm performance since they were wrongly denied benefits.

IMO in the framing of the above neither "human impacts" nor "model at scale" are relevant moral factors. Impacts on systems or animals seem equally relevant, as do seriously bad outcomes for a single person versus a large group.

I do think this suggests a potential separate item about exploring the differences from a manual system when enabling an algorithmic decision process: does the speed of predictions, volume of decisions, or inflexibility (i.e., human judgments can handle "new" variables in each decision) pose risks that aren't present in the equivalent human process?

glipstein commented 2 years ago

Yeah, maybe we can add something to that question about if the algorithmic decisions are causing downstream impacts that need evaluation—not just if the decision itself is wrong.

Will take a pass

I do think this suggests a potential separate item about exploring the differences from a manual system when enabling an algorithmic decision process: does the speed of predictions, volume of decisions, or inflexibility (i.e., human judgments can handle "new" variables in each decision) pose risks that aren't present in the equivalent human process?

I like this thought. This is also getting at what was intended by the "model at scale" piece, since it's both human vs machine and small development/prototype vs deployment at scale. This was a big topic from Cathy O'Neil's Weapons of Math Destruction back in the day (still relevant)

codecov-commenter commented 2 years ago

Codecov Report

Merging #140 (fc61a9a) into main (3907956) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #140   +/-   ##
=======================================
  Coverage   96.80%   96.80%           
=======================================
  Files           6        6           
  Lines         188      188           
=======================================
  Hits          182      182           
  Misses          6        6           

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3907956...fc61a9a. Read the comment docs.

glipstein commented 2 years ago

@pjbull @ejm714 pass ready for 👀; trying to keep in the one-question-per-item paradigm while recognizing there is a lot covered

glipstein commented 2 years ago

also added issue #147 for potential separate item on systematic effects cc @jayqi @ejm714 @pjbull