agrc / sweeper

🧹A cli tool for making data good 🧹
MIT License
4 stars 3 forks source link

Profanity check SGID data #107

Open gregbunce opened 1 year ago

gregbunce commented 1 year ago

it would be helpful to have a function that looks through the data names and scans for derogatory names in the data - think trailheads, trail names, place names, etc.. This could be a good opportunity to leverage AI.

gregbunce commented 1 year ago

FYI: we do have a derogatory name in the trailheads data - it's a former name. I'm working on this now to clean it up.

gregbunce commented 1 year ago

a possible solution to look at: https://github.com/surge-ai/profanity

steveoh commented 1 year ago

I think we'd probably stick to gcp or maybe aws.

https://cloud.google.com/natural-language/docs/moderating-text

gregbunce commented 8 months ago

Moving this FY25 Q1 and hopefully things will settle down a bit by then to make some progress on it.

steveoh commented 1 week ago

I submitted a google request to get the "s" word added. Are there any other words we know about in our data that needs to be replaced?

I tested the word and it's still not being flagged! Since I couldn't find the original request I created a new one (internal ref: 377718296). Product let me know the way I submitted the feature request should work to add new words. It may still take time to implement but I'll be able to provide you updates. Please let me know what other words we should be tracking.

ZachBeck commented 10 hours ago

Removed squaw names from NHD Lakes, NHD Streams, and UGRC version of the GNIS.