Closed mbodeantor closed 3 months ago
Something to keep in mind: Data Sources have one record_type
Example URLs linked here: https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/13#issuecomment-1696033512
I think using html-tag-collector
could be very helpful here! The regex might get more hits, which could be problematic, but some of the base URLs aren't super helpful.
Could also try some ML based on the record types already assigned in the data sources table
Updated issue to focus on ML instead of regex
If this works well for record_type
, we should get name
, description
(if we can get it to do a good job), agency_described
We might want to have it do geography_described
too (not something we currently store), which is often helpful for search/disambiguation or cases where the agency is unclear, or aggregated.
Just for clarity's sake, the name
and description
creation functionality have been made into a separate issue here at #43
With regards to fine-tuning a pretrained LLM, I may need more context as to what that means in this context -- is the idea to host our own LLM which we give prompts to? Or to utilize a fine-tuned model such as what OpenAI offers?
@maxachis I think the pattern will be fine-tuning a model using hugging face. We have two models which try to do this already:
https://huggingface.co/PDAP/coarse-url-classifier https://huggingface.co/PDAP/url-classifier
I'm calling this one closed, because of the models we have published; they leave something to be desired in terms of accuracy, but they exist, and we have plenty of issues for refining them
Since ChatGPT costs can add up fast, experiment with Huggingface's opensource APIs to fine-tune a pretrained LLM using the record_type labels from the data_sources table in the database and the labelled data from Doccano: https://huggingface.co/organizations/PDAP/share/HfuZkjoUlvkgThjZwiSoEakjDdeoOhxENO
In order to prep the dataset, tokenize the text of each webpage to create transformers for the model: https://huggingface.co/docs/transformers/training
Doccano data: labeled_231207.csv