Police-Data-Accessibility-Project / data-source-identification

Scripts for labeling relevant URLs as Data Sources.
MIT License
5 stars 6 forks source link

Assign Record Type Using LLM #12

Closed mbodeantor closed 3 months ago

mbodeantor commented 1 year ago

Since ChatGPT costs can add up fast, experiment with Huggingface's opensource APIs to fine-tune a pretrained LLM using the record_type labels from the data_sources table in the database and the labelled data from Doccano: https://huggingface.co/organizations/PDAP/share/HfuZkjoUlvkgThjZwiSoEakjDdeoOhxENO

In order to prep the dataset, tokenize the text of each webpage to create transformers for the model: https://huggingface.co/docs/transformers/training

Doccano data: labeled_231207.csv

josh-chamberlain commented 1 year ago

Something to keep in mind: Data Sources have one record_type

Example URLs linked here: https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/13#issuecomment-1696033512

I think using html-tag-collector could be very helpful here! The regex might get more hits, which could be problematic, but some of the base URLs aren't super helpful.

mbodeantor commented 1 year ago

Could also try some ML based on the record types already assigned in the data sources table

mbodeantor commented 10 months ago

Updated issue to focus on ML instead of regex

josh-chamberlain commented 9 months ago

If this works well for record_type, we should get name, description (if we can get it to do a good job), agency_described

We might want to have it do geography_described too (not something we currently store), which is often helpful for search/disambiguation or cases where the agency is unclear, or aggregated.

maxachis commented 7 months ago

Just for clarity's sake, the name and description creation functionality have been made into a separate issue here at #43

maxachis commented 7 months ago

With regards to fine-tuning a pretrained LLM, I may need more context as to what that means in this context -- is the idea to host our own LLM which we give prompts to? Or to utilize a fine-tuned model such as what OpenAI offers?

  1. If hosting our own, that'd require quite a few pricing considerations, as estimates for the cost of hosting an LLM varies wildly depending on implementation and use cases.
  2. A fine-tuned model such as OpenAI offers can allow for greater specificity, but does come at increased costs. Putting aside initial training costs for a fine-tuned model, running the lowest-cost fine-tuned GPT model will render inputs 6 times as expensive, and outputs 4 times as expensive, when comparing the pricing of their models.
josh-chamberlain commented 7 months ago

@maxachis I think the pattern will be fine-tuning a model using hugging face. We have two models which try to do this already:

https://huggingface.co/PDAP/coarse-url-classifier https://huggingface.co/PDAP/url-classifier

josh-chamberlain commented 3 months ago

I'm calling this one closed, because of the models we have published; they leave something to be desired in terms of accuracy, but they exist, and we have plenty of issues for refining them