Quick writeup based on our hack night discussion on 4/23 - this needs more refinement before we're ready to go, this is mostly a braindump.
Problem
Our ML experts want to be able to reference the Massachusetts General Laws (MGL) to hydrate data for our upcoming LLM-driven bill summaries. If a bill references any sections of the MGL, the model will need to fetch those sections in order to generate the prompt used to generate the summary (to ensure we aren't missing critical context).
They are currently doing so by scraping the HTML of the MGL website, but we want a less fragile solution going forward.
Proposal
At the beginning of a session, let's scrape the MGL, store text on our side, and expose it to our LLM wrapper service via a private API. We'll do this instead of relying on the MA Legislature API at runtime because <???> (missing context, are we looking for versioning on this or just speed)?
Success Criteria
[ ] Worker that scrapes the MGL and stores it in a Firestore collection, accessible by chapter + section
Maybe this doesn't need to be in Firestore? Chapter + Section -> Blob text seems like the relevant bits, let's check in with Matt V and Nathan on what they need here and whether this can be simpler.
[ ] Private API endpoint that lets a caller query by Chapter + Section
One bill can reference 50+ sections, so a batch endpoint is likely useful.
The total section text can be pretty large, we may need to chunk/stream the response.
Open Questions
What is going on with the Journal of Session Laws? Do bills reference those? Do we need to care? What exactly is the timing/process for moving a bill from the Journal of Session Laws to the MGL?
If the MGL web page and the Mass Legislature API disagree, which is correct?
The MGL web page says that it isn't the official version of the MGL - is this a major problem, or are issues caused by this covered by the disclaimer we'll have on LLM summaries anyway?
Quick diagram of my understanding from our hack night discussion on 4/23:
Quick writeup based on our hack night discussion on 4/23 - this needs more refinement before we're ready to go, this is mostly a braindump.
Problem
Our ML experts want to be able to reference the Massachusetts General Laws (MGL) to hydrate data for our upcoming LLM-driven bill summaries. If a bill references any sections of the MGL, the model will need to fetch those sections in order to generate the prompt used to generate the summary (to ensure we aren't missing critical context).
They are currently doing so by scraping the HTML of the MGL website, but we want a less fragile solution going forward.
Proposal
At the beginning of a session, let's scrape the MGL, store text on our side, and expose it to our LLM wrapper service via a private API. We'll do this instead of relying on the MA Legislature API at runtime because <???> (missing context, are we looking for versioning on this or just speed)?
Success Criteria
Open Questions
Quick diagram of my understanding from our hack night discussion on 4/23: