Open josh-chamberlain opened 4 months ago
Also worth exploring if there is any semi-consistent way to source a description directly from any of the tags
Also worth exploring if there is any semi-consistent way to source a description directly from any of the tags
I mentioned this in #16, but taking information from the home page of any url -- which my PR #36 is aiming to do -- is likely to aid in providing additional context -- since the home page would either be for the entire police department, or for the local government which the police department is based out of.
I additionally posed the question to ChatGPT about possible options we could take with this, and its answer seemed useful and relevant. https://chat.openai.com/share/c08fcb30-7012-443d-8a2e-0a8d448e05d7
@maxachis thanks, I clarified the suggested path
portion of the issue to lay out the strategy. Most of the pieces are in place / chosen, it's just a matter of connecting them
@josh-chamberlain Do we have ideas of what model to use? I don't have much in the way of prior experience in NLP, so I'd definitely defer to if someone such as @EvilDrPurple has a better idea of what model to use, but I have begun looking at some existing models that may have promise, such as https://huggingface.co/Falconsai/text_summarization
@josh-chamberlain Do we have ideas of what model to use? I don't have much in the way of prior experience in NLP, so I'd definitely defer to if someone such as @EvilDrPurple has a better idea of what model to use, but I have begun looking at some existing models that may have promise, such as https://huggingface.co/Falconsai/text_summarization
@maxachis I have not used any models for summarization yet, but if you haven't already found this page it may be of some help. It lists models that can be used for summarization near the top: https://huggingface.co/docs/transformers/tasks/summarization
So doing some preliminary research on this (and bearing in mind that my NLP experience is quite limited), here are my initial thoughts:
@josh-chamberlain I tested out the following entry on several use cases. The below is 467 from PDAP/urls-and-headers
url: https://www.sandiego.gov/risk-management/flexible-benefits/fbp-police-safety-members-fy2022
html_title: Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 | City of San Diego Official Website
h1: ["City of San Diego Official Website", "Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022"]
h2: ["Main navigation", "Leisure", "Resident Resources", "Doing Business", "Library", "Public Safety", "City Hall", "Accessibility Tools", "FBP Credits", "FBP Options", "Sharp Plan Additional Information", "Services", "Contact Info", "Orientation Materials", "Additional Resources", "Forms", "Footer"]
h3: ["Medical Plans", "Kaiser Permanente Traditional (HMO) Information", "Kaiser Permanente Traditional (HMO) Premiums", "Kaiser Permanente Deductible (HMO) Information", "Kaiser Permanente Deductible (HMO) Premiums", "Kaiser Partner Site", "Kaiser Additional Information", "Cigna (HMO) Information", "Cigna (HMO) Premiums", "Cigna Scripps Select (HMO) Premiums", "Cigna Open Access Plan (OAP) PPO Information", "Cigna Open Access Plan (OAP) PPO Premiums", "Cigna Additional Information", "Cigna Partnersite", "SDPEBA/Sharp Classic (HMO) Information", "SDPEBA/Sharp Classic (HMO) Premiums", "SDPEBA/Sharp Select (HMO) Information", "SDPEBA/Sharp Select (HMO) Premiums", "SDPEBA/Sharp Saver Deductible (HMO) Information", "SDPEBA/Sharp Saver Deductible (HMO) Premiums", "POA ALADS California Care Basic (HMO - No Dental) Information", "POA ALADS California Care Basic (HMO - No Dental) Premiums", "POA ALADS California Care Premier (HMO - with Dental) Information", "POA ALADS California Care Premier (HMO - with Dental) Premiums", "Dental Plans (Optional)", "Delta Dental\u00a0(DHMO) Information", "Delta Dental\u00a0(DHMO) Premiums", "Delta Dental (DPO) Information", "Delta Dental (DPO) Premiums", "Delta Dental Additional Information", "Delta Dental Partner Site", "Vision Plans (Optional)", "City VSP Information", "City VSP Premiums", "City VSP Partnersites", "Life Insurance Plans"]
h4: ["Parks", "Outdoors", "Neighborhoods", "Recreational Activities", "Street Maintenance", "Plan", "Fix", "Build", "Programs & Events", "Services", "Kids & Teens", "eCollection", "Police", "Fire-Rescue", "Lifeguards", "City Officials", "City Government"]
I tested this on a naive implementation of the t-5 model:
from transformers import pipeline
summarizer = pipeline("summarization", model="t5-small")
example_text = """ ... """
summary = summarizer(example_text, max_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text'])
And got the result: Summary: h2: ["Main navigation", "Leisure", "Resident Resources", "Doing Business", "Libr
Assuming the punctuation and tag identifiers might be a problem, I removed them and got
City of San Diego Official Website Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 Main navigation Leisure Resident
I then gave the original format as a prompt to ChatGPT with the prompt Given the following information from a web page, summarize in a single sentence what you think the page is:
For GPT 3.5:
The webpage appears to be from the City of San Diego's official website and provides information on Flexible Benefits Plan options for Police Safety Members for fiscal year 2022 and a short plan year 2022.
For GPT 4.0:
The webpage provides details on the Flexible Benefits Plan options available to Police Safety Members for Fiscal Year 2022 and Short Plan Year 2022 in San Diego, covering medical, dental, vision, and life insurance plans.
Obviously, the GPT summaries are the best and the least complicated to setup, but also the most expensive. Back-of-the-envelope math suggests at most $0.05 for a GPT-4 summary (aka $50 for 1000 summaries) and at most $0.005 for a GPT-3.5 summary (aka $5 for 1000 summaries). Being back-of-the-envelope, it's quite possible the actual costs would be cheaper, but that'd take more time to investigate.
There are likely other solutions, but finding them and testing their feasibility would take time.
After investigating more deeply into the OpenAI option, it seems I may have been off by a factor of 10 for GPT 3.5. I ran the above example with a prompt through the following code:
client = OpenAI(
api_key=os.getenv("OPENAI_API_KEY")
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You will receive a set of html content for a web page and provide a json "
"object with two keys: 'summary' (single sentence summary of web page) "
"and 'name' (descriptive name of web page)."},
{"role": "user", "content": example_text},
],
temperature=0,
)
print(response.choice)
Response was below:
{ "summary": "Explore the flexible benefits plan options for Police Safety Members for FY 2022 and Short Plan Year 2022 on the City of San Diego Official Website.", "name": "Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 | City of San Diego Official Website" }
Total input tokens: 730 Total output tokens: 70
Cost of input token ($0.0005/1K tokens) 730/1000 0.0005 = 0.000365 Cost of output tokens ($0.0015/1K tokens) 70/1000 0.0015 = 0.000105
0.000365 + 0.000105 = $0.00047
Assuming we made 1000 similar calls: 0.00047 * 1000 calls = $0.47
We can probably further reduce the amount of tokens through requiring shorter outputs and/or trimming the fat on the html content provided.
@maxachis thanks for doing the initial testing and groundwork. Since we're already going to be sending things through hugging face pipeline, could we pick a model there instead? There are a bunch of text classification models there. We could pretrain our own, or use an existing one.
random thought: rather than removing punctuation and headers, can we just explain "the page was scraped for the following meta and header content"? Seems more straightforward, in a way.
re: your points above, we can also have the model omit names and summaries where it thinks the record_type
and agency_described
are sufficient. Sometimes they are—"Jail roster for Allegheny County Jail" is tough to improve upon as a name.
@maxachis thanks for doing the initial testing and groundwork. Since we're already going to be sending things through hugging face pipeline, could we pick a model there instead? There are a bunch of text classification models there. We could pretrain our own, or use an existing one.
I’m skittish about doing so, for a few reasons:
Let me know your thoughts, @josh-chamberlain
@maxachis you can feel free to use chat GPT with an API call since that's faster.
eventually, we may need to use our own LLM, so:
Context
49 should come first
Requirements
As part of the data source identification pipeline, create these text fields for each data source automatically:
submitted_name
description
Suggested path
HTML tag collector