CouncilDataProject / cdp-backend

Data storage utilities and processing pipelines used by CDP instances.
https://councildataproject.org/cdp-backend
Mozilla Public License 2.0
22 stars 26 forks source link

Add `matter_full_text_uri` property to `Matter` definition and impl #162

Open evamaxfield opened 2 years ago

evamaxfield commented 2 years ago

While looking through the frontend and design work on the legislation tracking project, I realized for the first time that I think we may be missing a crucial piece of information which is a link to the actual full matter text.

Current Matter is defined as: link but a matter definitely has full text and we should store a link to that full text. I propose full_text_uri or some varient of that.

Additionally, while we look into this, it would be good to investigate which MatterFile's make it through the pipeline: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/pipeline/event_gather_pipeline.py#L1444

I think the above try-except block may be dropping some MatterFile / MinutesItemFile attachments that would be useful to keep and so we may want to try to fix it if we do see that behavior.

dphoria commented 2 years ago

What would be appropriate for matter full_text_uri, in this example: (I'm presuming the corresponding ingestion_models.Matter must change as well) https://seattle.legistar.com/MeetingDetail.aspx?ID=930274&GUID=903D2508-9840-4878-8334-1AEF77335BB8 https://gist.github.com/dphoria/3134769fe44686a82fdca2a55b822397

I will take a look later myself. Just wanted to start the question / conversation.

evamaxfield commented 2 years ago

Great question! Yes the ingestion model would need to be updated as well to add the same property / attribute.

Taking this meeting from legistar: https://seattle.legistar.com/MeetingDetail.aspx?ID=929921&GUID=3EB77948-2243-425A-9864-8CD868B96048&Options=&Search=

And selecting the first council bill (CB 120263), we get to: https://seattle.legistar.com/LegislationDetail.aspx?ID=5448143&GUID=4F8010D6-BEBB-46AF-BE22-F579AD681B68&Options=&Search=

I think what we want really just a link to that page / that above link since it has the full details. But if we wanted to get even more specific, I would say clicking "Reports" and then clicking "Legislation Text" or really any of the options gives us more of a "document view" like this: https://seattle.legistar.com/ViewReport.ashx?M=R&N=Text&GID=393&ID=4717976&GUID=660120D3-9C6F-4314-AFC7-A44217E71237&Title=Legislation+Text

evamaxfield commented 2 years ago

This is really a bigger deal because like.... currently we don't even store that info to CDP at all, here is the corresponding meeting page for that meeting on seattle staging: http://councildataproject.org/seattle-staging/#/events/f3351cc9822f

notice that the minutes item CB 120263 doesnt have any attachments / documents.

isaacna commented 2 years ago

Do we need a separate field for the full text, or could it just be another MatterFile? If we want to handle the full text differently in the UI than other MatterFiles than I'm all for adding full_text_uri, but otherwise I think it could be another MatterFile

I think the above try-except block may be dropping some MatterFile / MinutesItemFile attachments that would be useful to keep and so we may want to try to fix it if we do see that behavior.

For this it's most likely failing due to a connection timeout or an error when making an http request. Since the only validation run on MatterFile is resource_exists, I think it has to be one of these two

evamaxfield commented 2 years ago

Do we need a separate field for the full text, or could it just be another MatterFile? If we want to handle the full text differently in the UI than other MatterFiles than I'm all for adding full_text_uri, but otherwise I think it could be another MatterFile

I guess we could add this as a MatterFile but there I feel like we would need to add an attribute of type or something? Something to signify what each MatterFile represents (i.e. just a report, an amendment, or the bill text)

isaacna commented 2 years ago

I guess we could add this as a MatterFile but there I feel like we would need to add an attribute of type or something? Something to signify what each MatterFile represents (i.e. just a report, an amendment, or the bill text)

Since the full text uri is kinda distinct from other MatterFile's, I think we could just add full_text_uri to Matter (also saves us a query if we want to fetch this for a specific Matter).

Unless there are very discrete categories that we could classify MatterFile into, I don't think we need MatterFile.type and name would be sufficient.

evamaxfield commented 2 years ago

Yea the benefit to query time is also a major plus.