cityofaustin / atd-data-tech

Austin Transportation Data & Technology Services
17 stars 2 forks source link

Create set of redacted crash narratives for intersection visibility research #18522

Closed johnclary closed 2 months ago

johnclary commented 2 months ago

Michael K. has asked us to provide him a set of 5k-10k crash narratives that can be shared with their contractor who is assisting with research related to pedestrian-involved crashes at intersections.

This will involve:

johnclary commented 2 months ago

I cc'd Xavier on our email thread and asked if could help with the query:

Xavier, would you be able to help craft a query for this, or translate what's being requested to DB columns so that I can write the query?

ten years crashes involving pedestrians at intersections on Level 1 and Level 2 streets

xavierapostol commented 2 months ago

@xavierapostol

johnclary commented 2 months ago

We've received a set of narratives to redact. @Charlie-Henry will take it from here 🤖

johnclary commented 2 months ago

I sent Michael the results of Charlie running the set of narratives through the LLM. TLDR, Michael will need to redact most of the narratives by hand.

Charlie was able to run our narrative redaction tool on this dataset. The tool works by flagging crashes narratives that may contain any of the kinds of personal data that we itemize in our privacy policy, which is itself a list copied from Texas state law.

Unfortunately, the tool flagged 150 of the 240 narratives for potentially containing personal data. In the past we have been working with much larger datasets, so it wasn't a big deal for us to lazily discard half of the rows in a dataset. In this case, it is unfortunately going to take manual review of those narratives in order to be safe for them to share.

In the attached CSV, the last five columns indicate if the AI model flagged the narrative for containing some kind of personal data. If the last column, "none of the above", is true, then the narrative is safe to share.

As you can see by spot checking the narratives, the model errs on the side of caution in a big way. Many of the narratives it flagged can probably be safely shared, but they will require your manual review. You can choose to either edit the narratives directly by removing sensitive data, or you can exclude them from your dataset. Again, the list of "personal data" in the privacy policy is the kind of information we must weed out.

I'm sorry that we weren't able to be more helpful in this case! Let me know if I can clarify anything.