use NLP model to generate name and description for data sources

josh-chamberlain commented 4 months ago

Context

Part of #11
49 should come first

Requirements

As part of the data source identification pipeline, create these text fields for each data source automatically:

submitted_name
- current default is "record_type for agency_described"
description
- use content on the page to generate a suggestion from NLP. these should be short, and not overly redundant.
- may be important: date ranges, formats

Suggested path

use the data applied to the hugging face dataset by HTML tag collector
use an existing hugging face model to generate a name and description, using the HTML content as a prompt
- let's start here, only pretraining if we get unsatisfactory results
- we may need to truncate to prevent things getting expensive

mbodeantor commented 4 months ago

Also worth exploring if there is any semi-consistent way to source a description directly from any of the tags

maxachis commented 4 months ago

Also worth exploring if there is any semi-consistent way to source a description directly from any of the tags

I mentioned this in #16, but taking information from the home page of any url -- which my PR #36 is aiming to do -- is likely to aid in providing additional context -- since the home page would either be for the entire police department, or for the local government which the police department is based out of.

maxachis commented 4 months ago

I additionally posed the question to ChatGPT about possible options we could take with this, and its answer seemed useful and relevant. https://chat.openai.com/share/c08fcb30-7012-443d-8a2e-0a8d448e05d7

josh-chamberlain commented 4 months ago

@maxachis thanks, I clarified the suggested path portion of the issue to lay out the strategy. Most of the pieces are in place / chosen, it's just a matter of connecting them

maxachis commented 4 months ago

@josh-chamberlain Do we have ideas of what model to use? I don't have much in the way of prior experience in NLP, so I'd definitely defer to if someone such as @EvilDrPurple has a better idea of what model to use, but I have begun looking at some existing models that may have promise, such as https://huggingface.co/Falconsai/text_summarization

EvilDrPurple commented 4 months ago

@josh-chamberlain Do we have ideas of what model to use? I don't have much in the way of prior experience in NLP, so I'd definitely defer to if someone such as @EvilDrPurple has a better idea of what model to use, but I have begun looking at some existing models that may have promise, such as https://huggingface.co/Falconsai/text_summarization

@maxachis I have not used any models for summarization yet, but if you haven't already found this page it may be of some help. It lists models that can be used for summarization near the top: https://huggingface.co/docs/transformers/tasks/summarization

maxachis commented 4 months ago

So doing some preliminary research on this (and bearing in mind that my NLP experience is quite limited), here are my initial thoughts:

Utilizing information from the HTML tag collectors will give us a number of short sentences and text fragments. Combined, these can help with determining context, but not all NLP models are designed for dealing with disparate fragments -- they're often designed for summarizing paragraphs of text where everything can be assumed to be much more tightly related.
Bearing 1 in mind, the best models to use in this case would probably be GPT or BERT-based, which are better at taking into account larger context.
GPT would probably be the easiest way to do this -- I previously used calls to OpenAI's LLM API in my work on the "Law Reading Robot", where I used it to summarize laws -- but would also be the most expensive.
More content means better summaries. Not all web pages have a particularly dense amount of information in the tags that we've included in our tag collector, so quality could vary substantially. As an extreme example, this page, which is a jail roster, has its most relevant information conveyed in a tabular format, rather than in headings or header content.
With 4 in mind, it may be useful, at least in cases where we wouldn't have a high level of confidence in the information our current tags have, to gather more text from the web page. One could imagine a two-level summary process -- web pages where we can confidently generate summaries from the initial tags would be accepted, and web pages where confidence in quality is substantailly lower can be given a second pass where additional information is gathered.

maxachis commented 4 months ago

@josh-chamberlain I tested out the following entry on several use cases. The below is 467 from PDAP/urls-and-headers

url: https://www.sandiego.gov/risk-management/flexible-benefits/fbp-police-safety-members-fy2022    
html_title: Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 | City of San Diego Official Website  
h1: ["City of San Diego Official Website", "Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022"] 
h2: ["Main navigation", "Leisure", "Resident Resources", "Doing Business", "Library", "Public Safety", "City Hall", "Accessibility Tools", "FBP Credits", "FBP Options", "Sharp Plan Additional Information", "Services", "Contact Info", "Orientation Materials", "Additional Resources", "Forms", "Footer"]   
h3: ["Medical Plans", "Kaiser Permanente Traditional (HMO) Information", "Kaiser Permanente Traditional (HMO) Premiums", "Kaiser Permanente Deductible (HMO) Information", "Kaiser Permanente Deductible (HMO) Premiums", "Kaiser Partner Site", "Kaiser Additional Information", "Cigna (HMO) Information", "Cigna (HMO) Premiums", "Cigna Scripps Select (HMO) Premiums", "Cigna Open Access Plan (OAP) PPO Information", "Cigna Open Access Plan (OAP) PPO Premiums", "Cigna Additional Information", "Cigna Partnersite", "SDPEBA/Sharp Classic (HMO) Information", "SDPEBA/Sharp Classic (HMO) Premiums", "SDPEBA/Sharp Select (HMO) Information", "SDPEBA/Sharp Select (HMO) Premiums", "SDPEBA/Sharp Saver Deductible (HMO) Information", "SDPEBA/Sharp Saver Deductible (HMO) Premiums", "POA ALADS California Care Basic (HMO - No Dental) Information", "POA ALADS California Care Basic (HMO - No Dental) Premiums", "POA ALADS California Care Premier (HMO - with Dental) Information", "POA ALADS California Care Premier (HMO - with Dental) Premiums", "Dental Plans (Optional)", "Delta Dental\u00a0(DHMO) Information", "Delta Dental\u00a0(DHMO) Premiums", "Delta Dental (DPO) Information", "Delta Dental (DPO) Premiums", "Delta Dental Additional Information", "Delta Dental Partner Site", "Vision Plans (Optional)", "City VSP Information", "City VSP Premiums", "City VSP Partnersites", "Life Insurance Plans"]    
h4: ["Parks", "Outdoors", "Neighborhoods", "Recreational Activities", "Street Maintenance", "Plan", "Fix", "Build", "Programs & Events", "Services", "Kids & Teens", "eCollection", "Police", "Fire-Rescue", "Lifeguards", "City Officials", "City Government"]

I tested this on a naive implementation of the t-5 model:

from transformers import pipeline
summarizer = pipeline("summarization", model="t5-small")
example_text = """ ... """
summary = summarizer(example_text, max_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text'])

And got the result: Summary: h2: ["Main navigation", "Leisure", "Resident Resources", "Doing Business", "Libr

Assuming the punctuation and tag identifiers might be a problem, I removed them and got

City of San Diego Official Website Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 Main navigation Leisure Resident

I then gave the original format as a prompt to ChatGPT with the prompt Given the following information from a web page, summarize in a single sentence what you think the page is:

For GPT 3.5:

The webpage appears to be from the City of San Diego's official website and provides information on Flexible Benefits Plan options for Police Safety Members for fiscal year 2022 and a short plan year 2022.

For GPT 4.0:

The webpage provides details on the Flexible Benefits Plan options available to Police Safety Members for Fiscal Year 2022 and Short Plan Year 2022 in San Diego, covering medical, dental, vision, and life insurance plans.

Obviously, the GPT summaries are the best and the least complicated to setup, but also the most expensive. Back-of-the-envelope math suggests at most $0.05 for a GPT-4 summary (aka $50 for 1000 summaries) and at most $0.005 for a GPT-3.5 summary (aka $5 for 1000 summaries). Being back-of-the-envelope, it's quite possible the actual costs would be cheaper, but that'd take more time to investigate.

There are likely other solutions, but finding them and testing their feasibility would take time.

maxachis commented 4 months ago

After investigating more deeply into the OpenAI option, it seems I may have been off by a factor of 10 for GPT 3.5. I ran the above example with a prompt through the following code:

    client = OpenAI(
        api_key=os.getenv("OPENAI_API_KEY")
    )
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You will receive a set of html content for a web page and provide a json "
                                          "object with two keys: 'summary' (single sentence summary of web page) "
                                          "and 'name' (descriptive name of web page)."},
            {"role": "user", "content": example_text},
        ],
        temperature=0,
    )
    print(response.choice)

Response was below:

{ "summary": "Explore the flexible benefits plan options for Police Safety Members for FY 2022 and Short Plan Year 2022 on the City of San Diego Official Website.", "name": "Flexible Benefits Plan Options for Police Safety Members FY 2022 and Short Plan Year 2022 | City of San Diego Official Website" }

Total input tokens: 730 Total output tokens: 70

Cost of input token ($0.0005/1K tokens) 730/1000 0.0005 = 0.000365 Cost of output tokens ($0.0015/1K tokens) 70/1000 0.0015 = 0.000105

0.000365 + 0.000105 = $0.00047

Assuming we made 1000 similar calls: 0.00047 * 1000 calls = $0.47

We can probably further reduce the amount of tokens through requiring shorter outputs and/or trimming the fat on the html content provided.

josh-chamberlain commented 3 months ago

@maxachis thanks for doing the initial testing and groundwork. Since we're already going to be sending things through hugging face pipeline, could we pick a model there instead? There are a bunch of text classification models there. We could pretrain our own, or use an existing one.

random thought: rather than removing punctuation and headers, can we just explain "the page was scraped for the following meta and header content"? Seems more straightforward, in a way.

re: your points above, we can also have the model omit names and summaries where it thinks the record_type and agency_described are sufficient. Sometimes they are—"Jail roster for Allegheny County Jail" is tough to improve upon as a name.

maxachis commented 3 months ago

@maxachis thanks for doing the initial testing and groundwork. Since we're already going to be sending things through hugging face pipeline, could we pick a model there instead? There are a bunch of text classification models there. We could pretrain our own, or use an existing one.

I’m skittish about doing so, for a few reasons:

LLM options on Huggingface often require more setup, storage, and computing power than simply making a call to an API as with OpenAI or other models. This tutorial alone advises that LLMs are “resource-intensive and should be executed on a GPU for adequate throughput”.
Alternatives to LLMs, such as BART, do not perform as well with abstractive summarization tasks, and are likely to perform even worse due to the content being in disparate html tags rather than naturally flowing paragraphs.
It will take time to learn how to properly set up these models even if I do utilize them, compared to the plug-and-play nature of an API call. Pretraining will take even more time, and require volunteers and/or additional infrastructure setup. If we have a simpler option that can be set up more easily (even if, on a per-call basis, it is pricier), that may be a worthwhile option simply because it spares us the substantial up-front costs, which may end up ultimately out-pricing the per-call cost of an API call.

Let me know your thoughts, @josh-chamberlain

josh-chamberlain commented 3 months ago

@maxachis you can feel free to use chat GPT with an API call since that's faster.

eventually, we may need to use our own LLM, so:

we can spin up a GPU on digital ocean any time
noted
sure, we can iterate but starting fast makes sense

Police-Data-Accessibility-Project / data-source-identification