Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

experiment with a model for record type categories #42

Closed josh-chamberlain closed 4 months ago

josh-chamberlain commented 6 months ago

How much better does the model perform if, instead of asking it for record_type, we ask it for record_type_category? We can use the existing data for this. It also makes annotation much easier.

The more granular record types are useful for people, but we may be able to get by without them.

https://docs.pdap.io/activities/data-dictionaries/record-types-taxonomy

mbodeantor commented 6 months ago

Probably easiest to merge the current labeled data to one that associates each record_type to a record_type_category and use the record_type_category as the label instead.

@josh-chamberlain Does this association exist somewhere already or would we need to create it?

josh-chamberlain commented 6 months ago

@mbodeantor it doesn't exist anywhere but that doc, nowhere in the data.

bonjarlow commented 6 months ago

I've made a map for labels -> label_category, the updated csv of [url, label, label_category,...] will be up on hugging face or anywhere else you'd like it

josh-chamberlain commented 4 months ago

@bonjarlow do you want to link your Hugging Face work to this / do you think there's anything else to do before calling this close? we can always continue to refine the model, but I think you did it

bonjarlow commented 4 months ago

Agreed, conclusion is that record type category (coarse label) performs better than fine labels

https://huggingface.co/PDAP/coarse-url-classifier