codeforsanjose / city-agenda-scraper

9 stars 16 forks source link

Agenda item and attachment link extraction from pdf text #35

Closed swotai closed 3 years ago

swotai commented 3 years ago

Based on the output of #34, extract corresponding agenda items and attachment links.

This work is very similar to what Roland has presented previously. If we get permission to use their code that'll be great. Otherwise, we can also do it.

xconnieex commented 3 years ago

It's not clear to me how they are utilizing the subID and mainID columns. I think that's something else we need to look into and try to understand. mainID seems like headings within the agenda, which currently the scraping team doesn't grab since they only get the items that have an attachment file in Legistar. However the concept of subID is similar to what we discussed and could work, but we may need an additional level of subID to account for items with multiple attachments.

image

Our version: image

The version I mocked up seems comprehensive but also seems like it stores a lot of redundant data due to the potential for multiple attachments. Maybe there's a different way to do it? Roland's way looks simpler because all agenda items are associated with only the one PDF as far as I can see.

If the scraping team can provide all the relevant PDFs we can try to use Roland's method and "match up" what we can find using regex and what the scraping team gets from the attachments so we have both sources. Otherwise we could try to get to the attachments ourselves through the PDFs.

swotai commented 3 years ago

Not needed anymore, for now.