Spider: Illinois Health Facilities and Services Review Board

pjsier commented 3 years ago

URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx Spider Name: il_health_facilities Agency Name: Illinois Health Facilities and Services Review Board

masoodqq commented 3 years ago

I would like to work on this issu.

pjsier commented 3 years ago

@masoodqq sounds great! Assigning you now

Ni3dzwi3dz commented 3 years ago

Hi, is this issue still opened? If yes, i`m willing to help.

palakshivlani-11 commented 1 year ago

Hi, is this issue still open

haileyhoyat commented 1 year ago

@palakshivlani-11 Hi. Thanks so much for checking out the project. Go for it.

mohdyawars commented 1 year ago

willing to help

haileyhoyat commented 1 year ago

@yawar1101 hi. thanks so much for checking us out. go for it.

godclause commented 9 months ago

Hi @haileyhoyat!

I'd like to try my hand on this one.

haileyhoyat commented 8 months ago

@godclause Hello! Go for it. Cheers.

godclause commented 8 months ago

@haileyhoyat @palakshivlani-11 @yawar1101 @pjsier

Hi!

I hope I'm not overthinking on this question(s):

The challenging issue with URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx is that "juicy" meeting details are only available as downloadable PDFs via hyperlinks on ASP.NET web pages.

I have 'discovered' only a few Python libraries useful for scraping PDFs, but none seem to work for remote scraping,if that makes sense.

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?
Is there not an available Python solution, i.e. library, etc. to scrape PDFs as they live on the web, without storing them locally?
Is there a standard procedure / code of conduct, etc. that I should follow for applying security and dependency updates for this repo?

appills commented 7 months ago

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)

from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

godclause commented 7 months ago

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)
from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @haileyhoyat

I have sustained a perspective that running python operations for city-scrapers is not wholly shell agnostic (e.g. zsh, bash, csh). With this in mind, I believe package management has proven to be problematic and ought to be considered when starting a project.

The problem is that zsh is now the default shell on macs. The commands provided in our docs support bash.

For the folks running zsh, should there be verbiage included in our docs to remind us to consider changing our shell if we're running Catalina and beyond?

If not, should there be language updated in the docs that explains the following scenarios:

For Macs

Switch to bash shell and run all commands from there.
Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run your terminal / CLI commands via bash.

For Windows based machines

Build a Linux Virtual Machine [Ubuntu, Mint, etc.] and run terminal / CLI commands via bash.

Is any of this at all necessary?

onyangojerry commented 7 months ago

@godclause

yes, there should be verbiage guiding us and or reminding us to either switch to bash or use a virtual machine just so to save us the the agony that comes with continuous frustrations. I believe it is necessary.

haileyhoyat commented 7 months ago

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

godclause commented 7 months ago

@godclause @appills @onyangojerry

Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project.

Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things.

Cheers, all.

Thank you for the introduction @hails. Hi Dan, nice to meet you here.

SimmonsRitchie commented 7 months ago

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

godclause commented 7 months ago

Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat.

@godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing I think @appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

@SimmonsRitchie Hi!

I have 'some' thoughts...

For my parsing issue, @appills's answer did prompt my initial inquiry into a need for clarity about OS (mac) updates and how python depedency installation is affected.
How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos?

I'm hoping I'm within scope on these concerns.

appills commented 7 months ago

Python module/package dependencies should work regardless of platform, are you having problems?

On Thu, Feb 1, 2024 at 5:22 PM shinda @.***> wrote:

Hi there, @godclause https://github.com/godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat https://github.com/haileyhoyat.

@godclause https://github.com/godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation.

Re: saving PDFs as files before parsing I think @appills https://github.com/appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient.

Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts!

@SimmonsRitchie https://github.com/SimmonsRitchie Hi!

I have 'some' thoughts...

1.

For my parsing issue, @appills https://github.com/appills's answer did result in a initial inquiry into a need for clarity about OS (mac) updates and how those are affecting python depedencies. 2.

How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos?

I'm hoping I'm within scope on these concerns.

— Reply to this email directly, view it on GitHub https://github.com/City-Bureau/city-scrapers/issues/1001#issuecomment-1922368657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q . You are receiving this because you were mentioned.Message ID: @.***>

godclause commented 7 months ago

Python module/package dependencies should work regardless of platform, are you having problems? … On Thu, Feb 1, 2024 at 5:22 PM shinda @.> wrote: Hi there, @godclause https://github.com/godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat https://github.com/haileyhoyat. @godclause https://github.com/godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation. Re: saving PDFs as files before parsing I think @appills https://github.com/appills may have already answered your question, but yes, you can just parse the in-memory byte sequence of the PDF rather than writing it to a file and then parsing it. This is generally more efficient. Re: shell/OS issues I am a Mac user but I have experienced my own headaches with a number of the city-scraper repos. To my mind I think it may make a lot of sense to dockerize all the projects. I hope this will make them OS-agonistic and improve the dev experience overall (especially for newcomers). I'd be very interested in any feedback on this subject though. Let me know if you have thoughts! @SimmonsRitchie https://github.com/SimmonsRitchie Hi! I have 'some' thoughts... 1. For my parsing issue, @appills https://github.com/appills's answer did result in a initial inquiry into a need for clarity about OS (mac) updates and how those are affecting python depedencies. 2. How does Docker compare / contrast in 'usability' versus Vagrant for platform (OS) agnosticism? What's the expected long-term benefit attributed to Docker for City Scrapers projects versus Vagrant, from a support perspective? I mean, I'm all for improved performance, a better experience for newcomers, etc., but what will implementing Docker instead of Vagrant and vice versa cost City Scrapers' repos? I'm hoping I'm within scope on these concerns. — Reply to this email directly, view it on GitHub <#1001 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q . You are receiving this because you were mentioned.Message ID: @.>

@appills I have edited my comment above. Please excuse the error. Thank you in advance.

To your question, I do not believe there to be personal problems associated to module / package dependencies. Dependencies 'should' work regardless of platform (OS), shell environment. The case I did encounter will suggest otherwise for zsh.

Also, there is an evolving consensus that containerizing city-scrapers addresses that concern.

godclause commented 5 months ago

Is the general consensus around scraping PDF documents associated to an idea that they absolutely must be downloaded, stored and locally scraped?

No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)
from io import BytesIO
from pypdf import PdfReader

resp = requests.get("https://hfsrb.illinois.gov/content/dam/soi/en/web/hfsrb/events/documents/2024/january-23-2024-state-board-meeting-/21-007(4)%20Permit%20Renewal%20Winchester%20ASTC.pdf")
print(resp.headers)
reader = PdfReader(BytesIO(resp.content))
print(reader.pages[0].extract_text())

@appills @onyangojerry

Hello:

It seems the expected behavior on this code snippet is to parse text from only a single file.

How does our spider parse pdf's for all future / additional meetings, considering start URL as 'https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx'?

City-Bureau / city-scrapers

Spider: Illinois Health Facilities and Services Review Board #1001

City-Bureau / city-scrapers

Spider: ​​​​​Illinois Health Facilities and Services Review Board #1001

Spider: Illinois Health Facilities and Services Review Board #1001