StatCan / aaw

Documentation for the Advanced Analytics Workspace Platform
https://statcan.github.io/aaw/
Other
67 stars 12 forks source link

AAW-Chrome-Web-Scraping: Web-Scraping with Chrome #1029

Open ParadisF opened 2 years ago

ParadisF commented 2 years ago

Greetings,

I would like an option to have an remote desktop with Chrome and ChromeDriver for web-scraping with selenium.

Similar issue : https://github.com/StatCan/aaw-kubeflow-containers/issues/33

Its currently possible to web-scrape with the default Firefox, but firebox makes it hard to do so if the element you want to select is not visible in the window, making web-scraping more complication. Chrome and ChromeDriver does not have that problem and can find element regardless if its currently in the window or not.

Thank you for your time, ParadisF

Jose-Matsuda commented 2 years ago

Will be discussed at Sprint Planning today and I will update this comment with any relevant info.

As Blair said it might be good to have a separate container in aaw-kubeflow-containers that is specific to this kind of web scraping request. I suppose my question there would be, will it be a copy of remote-desktop except with chromium + chrome driver? Or would it be a more 'minimal' version of the current remote-desktop image? I do not know how much of an increase in image size chromium and chrome driver would cause because if its not too big I don't know why we wouldn't have it on top of remote-desktop.

Jose-Matsuda commented 2 years ago

Will have a custom image for you to test out and try soon. Right now it is just remote desktop with chrome and chrome driver installed. If that works all fine we will making the image smaller by removing other applications on remote desktop such as QGIS to keep the image size small.

Somewhat annoying is that when you install chrome, there does not appear to be a way to pin the version. As such, we will need to be on the watch for when we make changes to ensure that the installed chrome-driver matches chrome's version else it will fail.

ParadisF commented 2 years ago

With selenium v4 I believe, the chrome-driver get downloaded each time and installed from the cache. Example the driver is now setup like this, I don't have (path_to_driver) anymore.

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

Would that be allowed in the remote-image ?

Jose-Matsuda commented 2 years ago

@ParadisF Here's a remote desktop image you can test and do some things with, I was able to do something super basic https://github.com/StatCan/aaw-kubeflow-containers/pull/343#issuecomment-1106704347

Create a new machine and use the following custom image k8scc01covidacr.azurecr.io/remote-desktop:df7c1c74

Also would you require programs such as Rstudio, PSPP, QGIS Desktop or OpenM++ as well or can I remove those installations to keep our images smaller?

Jose-Matsuda commented 2 years ago

Posting an update here for posterity. If Blair has not reached out to you already yesterday during standup Blair said something to the effect of;

"There are people in the datascience division who are more knowledgable about webscraping and are interested in forking kubeflow-containers so that they can really specify what is needed as well as do some extra features that we would not do ourselves / better left in the hand of them."

ParadisF commented 2 years ago

@Jose-Matsuda,

For your first question : I mainly just use python with VScode i wont need Rstudio, PSPP, QGIS Desktop or OpenM++

For your last comment, I am not part of the datascience division, I am supposed to be an economist but what we are asked to do goes more and more into the programmer / developper category. Citizen developper they call it ? Basically I am all self-taught and I have no idea what I am doing haha. So having your support is great !

Jose-Matsuda commented 2 years ago

@ParadisF just in case you missed it if you really do need it at this moment you could use the custom image I linked above responsibly (though I'm not too deep into webscraping as I've said many times so I don't know), might just need to do some extra installs that you can do yourself to get it to work.

blairdrummond commented 2 years ago

CC @StanHatko

Jose-Matsuda commented 2 years ago

Following a conversation with Stan here were some suggestions as well as my initial findings;

and I think that's all that was really gleaned from that meeting. Stan if you happen to see this let me know if there's anything important that I missed.

blairdrummond commented 2 years ago

@StanHatko if there is any special setup required, would you be OK with contributing some basic example notebooks for Gecko and/or Chromedriver based scraping? This will help everyone who uses the images get up-to-speed faster.

Also, CC @goatsweater, I think there is an official "Web Scraping Policy" document that might be good to include in the image?

goatsweater commented 2 years ago

I think you're referring to the directive on web scraping at https://icn-rci.statcan.gc.ca/31/31b/31b_038-eng.html. Not sure how, but I think linking to the directive is better than trying to include it directly.

blairdrummond commented 2 years ago

Unfortunately icn links wouldn't work inside the remote-desktop machine :(

Jose-Matsuda commented 2 years ago

We could download it (if we're ok to do so) and maybe put it on the desktop? Or make a custom version of your shell_helpers.sh and link it and tell them to use their provisioned machines to open it lol

goatsweater commented 2 years ago

The directives are public info, but if you're going to have a copy I'd make it clear it's a copy and when in doubt they should reference the official one.

Might be good to think about some kind of signal to update the image if the directive changes. Doesn't likely change often, but it would be bad to spread wrong info.

Is it ECN accessible? Would an ECN link work?

Jose-Matsuda commented 2 years ago

Ok so for now we will omit placing the policy.

ParadisF commented 2 years ago

@ParadisF just in case you missed it if you really do need it at this moment you could use the custom image I linked above responsibly (though I'm not too deep into webs craping as I've said many times so I don't know), might just need to do some extra installs that you can do yourself to get it to work.

Hello @Jose-Matsuda, so today was my web-scrapping day, it was close, I had to manually install some packages but the package webdriver_manager worked and downloaded automatically the right chrome driver for the job. it almost worked but it looks like chrome itself in the remote image is having a hard time, my error is due to chrome page crashing.

image

I want to point out that don't "require" this image per say, I can web scrape from my personal computer, and I can transfer the public data via GitHub, reading the directive guidelines, there is nothing saying I that can't I believe.

That being said, I got a "feedback interview" from DaAAs and it seems that AAW/DaAAS are trying hard to make something that people like me would like to use (even though I am not sure I am the right target audience), so this is the reason for requesting this feature.

Jose-Matsuda commented 2 years ago

Thank you @ParadisF for the feedback, and yeah having that extra step to transfer does seem a bit tedious.

I did have a better image ready for you (as in you don't need to install as many things they are now pre-installed) but our CI changed so that development images can only be used on our dev cluster (which if you wanted you could use and try, but I expect Chrome would still have a hard time).

The priorities for this switched up a bit on our side, and with Chrome beginning to have a hard time the AAW team probably needs to take time to roadmap and really asses it.

CC @chuckbelisle for priorities and the like