Metro-Records / la-metro-councilmatic

:metro: An instance of councilmatic for LA Metro
MIT License
6 stars 2 forks source link

Scrape and display hyperlink attachments #484

Open jmithani opened 5 years ago

jmithani commented 5 years ago

per @shrayshray:

Could you check to see whether reports that have hyperlink attachments are showing up on the Councilmatic site? In Legistar, a person can use “Attachments” to attach a link instead of a file. We hadn’t addressed this before, and I’m not sure how often people do it - but they’re likely to do so more often in the future, so I wanted to confirm whether Councilmatic is already prepared to handle it. I created a test report for you to look at in the API: File ID 2018-0518 / MatterID 5249

jmithani commented 5 years ago

Looking over python-legistar-scraper, it appears as though all attachments will be grabbed and saved.

https://github.com/opencivicdata/python-legistar-scraper/blob/a9cea04cb226519aaeed6bfa3a9b3ed6f44e90d3/legistar/bills.py#L285-L298

Steps:

  1. Make sure the hyperlink attachment is being scraped.
  2. Make sure the hyperlink attachment is saved in the LAMetroBill object.
  3. Format the hyperlink to its own PDF for saving in the metro-pdf-merger process.
  4. Display link as attachment on bill page.

Question: are there other instances of attachments not being (something convertable to) PDFs? What is done in that case? cc @hancush

jmithani commented 5 years ago

Talked with @hancush about this. We have a few more questions to investigate.

Additional things to do:

hancush commented 5 years ago

@jmithani the bill you linked to looks like the scraper interpreted it as a private bill, i.e., we only scrape the bare minimum, which excludes attachments.

to truly test this out, we need a public test bill. (or to find an existing public bill with non-file hyperlink attachments.)

hancush commented 5 years ago

shelly updated the matter body name to z test z and made the test bill viewable insite, so we can scrape it.

jmithani commented 5 years ago

And it was scraped! https://ocd.datamade.us/ocd-bill/29060a5c-4ed2-4efa-9240-59d8fdb9d478/

The attachment is saved as link but then the media_type is classified as application/pdf. I'm going to check out the metro-pdf-merger logs for errors, and also investigate how the media_type is assigned.

jmithani commented 5 years ago

@shrayshray I was able to confirm that the link attachment did not make it into the Councilmatic website after being scraped. Now that we have a problem to fix, how would you prioritize this in the project board?

jmithani commented 5 years ago

hm, metro-pdf-merger seems to think the merge was successful

[September 06, 2019 - 06:55:27][INFO] tasks.py:line:118 | Successful merge! ocd-bill-29060a5c-4ed2-4efa-9240-59d8fdb9d478

Why I said it didn't make it into Councilmatic is because I re-imported my data locally including restricted view items, and there wasn't an attachments.

Edit: It exists! metro-pdf-merger.datamade.us/document/ocd-bill-29060a5c-4ed2-4efa-9240-59d8fdb9d478

So when there's a link, it looks like it prints out the webpage into a PDF.

shrayshray commented 5 years ago

@jmithani I moved this to the Icebox for now; fixing it is lower priority than other tasks lined up at the moment.

hancush commented 1 year ago

Most common use case: Link to documents too large to attach. Ensure hyperlinked attachments are retrieved and included in packets.

hancush commented 1 year ago

Shelly can provide examples of board reports w/ hyperlinked attachments. I'll describe the current behavior of our scraper.

shrayshray commented 1 year ago

2015-1601 (Attachment A) 2022-0098 (Attachment I)

hancush commented 1 year ago

@shrayshray The scraper associates attachments of all types with a board report, then the merger retrieves each attachment and compiles them with the text of the board report into a packet. So, unless I'm misunderstanding, this should be behaving as expected. See the attachments listed and packet compiled for https://boardagendas.metro.net/board-report/2022-0098/

shrayshray commented 1 year ago

@hancush great news - yes, this is behaving as expected when the destination of the hyperlink is a PDF. In some cases, the hyperlink destination may be another file type OR just a webpage. We will need to find or create some examples with these types of hyperlinks. I know sometimes attachments are .doc or .PPT files - I will try to find some examples and see how these are handled in the current packet output, assuming they would be the same if they are hyperlinked. The webpage hyperlinks are the real wildcards.

hancush commented 1 year ago

To test:

hancush commented 1 year ago
hancush commented 1 year ago

@shrayshray to create one test report per file type

antidipyramid commented 1 month ago

@neilarellano will try to create these test files.

neilarellano commented 1 week ago

Hi Team,

I ran some tests and found the following: