Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Create a HF model for labeling urls as relevant or irrelevant #31

Closed EvilDrPurple closed 5 months ago

EvilDrPurple commented 9 months ago

Context

Part of #12 Now that we've created an experimental model for record type labeling, we want to create another model that simply labels urls based on relevancy.

Requirements

josh-chamberlain commented 9 months ago

as another step toward #12 we could use the subcategories from here: https://docs.pdap.io/activities/data-dictionaries/record-types-taxonomy

i.e. Police & public interactions, Info about officers, etc.

We're going to start grouping them this way, so the record_type can have less fidelity.

maxachis commented 8 months ago

@EvilDrPurple when we are talking about relevancy, what counts as relevant and irrelevant, and what are representative samples of each?

EvilDrPurple commented 8 months ago

@EvilDrPurple when we are talking about relevancy, what counts as relevant and irrelevant, and what are representative samples of each?

@maxachis I just uploaded a new dataset to the PDAP Hugging Face page Here. Essentially, a URL is relevant if it has relevant data pertaining to criminal justice, otherwise it is irrelevant. You can look at the record types taxonomy link Josh posted above to see the categories for what types of webpages are considered "relevant"

Hope this helps!

josh-chamberlain commented 8 months ago

yes, "relevant" ≠ "useful". we're just checking whether it contains information about a police, jail, or court agency in the United States.