Open jmithani opened 5 years ago
Looking over python-legistar-scraper
, it appears as though all attachments will be grabbed and saved.
Steps:
LAMetroBill
object.metro-pdf-merger
process.Question: are there other instances of attachments not being (something convertable to) PDFs? What is done in that case? cc @hancush
Talked with @hancush about this. We have a few more questions to investigate.
metro-pdf-merger
handle attachments that aren't one of the specified file types?metro-pdf-merger
, does the entire packet fail?Additional things to do:
metro-pdf-merger
logs for potential errors related to the test bill aboveocd
(doesn't seem like it)@jmithani the bill you linked to looks like the scraper interpreted it as a private bill, i.e., we only scrape the bare minimum, which excludes attachments.
to truly test this out, we need a public test bill. (or to find an existing public bill with non-file hyperlink attachments.)
shelly updated the matter body name to z test z
and made the test bill viewable insite, so we can scrape it.
And it was scraped! https://ocd.datamade.us/ocd-bill/29060a5c-4ed2-4efa-9240-59d8fdb9d478/
The attachment is saved as link
but then the media_type
is classified as application/pdf
. I'm going to check out the metro-pdf-merger
logs for errors, and also investigate how the media_type
is assigned.
@shrayshray I was able to confirm that the link attachment did not make it into the Councilmatic website after being scraped. Now that we have a problem to fix, how would you prioritize this in the project board?
hm, metro-pdf-merger
seems to think the merge was successful
[September 06, 2019 - 06:55:27][INFO] tasks.py:line:118 | Successful merge! ocd-bill-29060a5c-4ed2-4efa-9240-59d8fdb9d478
Why I said it didn't make it into Councilmatic is because I re-imported my data locally including restricted view items, and there wasn't an attachments
.
Edit: It exists! metro-pdf-merger.datamade.us/document/ocd-bill-29060a5c-4ed2-4efa-9240-59d8fdb9d478
So when there's a link, it looks like it prints out the webpage into a PDF.
@jmithani I moved this to the Icebox for now; fixing it is lower priority than other tasks lined up at the moment.
Most common use case: Link to documents too large to attach. Ensure hyperlinked attachments are retrieved and included in packets.
Shelly can provide examples of board reports w/ hyperlinked attachments. I'll describe the current behavior of our scraper.
2015-1601 (Attachment A) 2022-0098 (Attachment I)
@shrayshray The scraper associates attachments of all types with a board report, then the merger retrieves each attachment and compiles them with the text of the board report into a packet. So, unless I'm misunderstanding, this should be behaving as expected. See the attachments listed and packet compiled for https://boardagendas.metro.net/board-report/2022-0098/
@hancush great news - yes, this is behaving as expected when the destination of the hyperlink is a PDF. In some cases, the hyperlink destination may be another file type OR just a webpage. We will need to find or create some examples with these types of hyperlinks. I know sometimes attachments are .doc or .PPT files - I will try to find some examples and see how these are handled in the current packet output, assuming they would be the same if they are hyperlinked. The webpage hyperlinks are the real wildcards.
To test:
@shrayshray to create one test report per file type
@neilarellano will try to create these test files.
Hi Team,
I ran some tests and found the following:
per @shrayshray: