Agenda item and attachment link extraction from pdf text

swotai commented 3 years ago

Based on the output of #34, extract corresponding agenda items and attachment links.

This work is very similar to what Roland has presented previously. If we get permission to use their code that'll be great. Otherwise, we can also do it.

[ ] Extract agenda items using regex. This involves potentially mapping out all the relevant generalized agenda items
[ ] Define the data structure of the output. This is also very similar to what Roland has demoed on July 29th

xconnieex commented 3 years ago

It's not clear to me how they are utilizing the subID and mainID columns. I think that's something else we need to look into and try to understand. mainID seems like headings within the agenda, which currently the scraping team doesn't grab since they only get the items that have an attachment file in Legistar. However the concept of subID is similar to what we discussed and could work, but we may need an additional level of subID to account for items with multiple attachments.

Our version:

The version I mocked up seems comprehensive but also seems like it stores a lot of redundant data due to the potential for multiple attachments. Maybe there's a different way to do it? Roland's way looks simpler because all agenda items are associated with only the one PDF as far as I can see.

If the scraping team can provide all the relevant PDFs we can try to use Roland's method and "match up" what we can find using regex and what the scraping team gets from the attachments so we have both sources. Otherwise we could try to get to the attachments ourselves through the PDFs.

swotai commented 3 years ago

Not needed anymore, for now.

codeforsanjose / city-agenda-scraper

Agenda item and attachment link extraction from pdf text #35