TechAndCheck / hypatia

A server for various scraping tems
MIT License
0 stars 0 forks source link

Scraping a private URL should fail better #31

Open reefdog opened 2 years ago

reefdog commented 2 years ago

While testing, I tried to scrape this URL: https://www.instagram.com/p/Cd3-yVDOsLU7nV-8w_cfbhk81PCkTeRdH4itjI0/

Hypatia threw this error:

2022-05-23T20:21:34.922Z pid=23540 tid=5b4 class=ScrapeJob jid=a5c06aa86048593f267b2e3e INFO: start
2022-05-23T20:21:34.926Z pid=23540 tid=5b4 class=ScrapeJob jid=a5c06aa86048593f267b2e3e INFO: Performing ScrapeJob (Job ID: 41fd15be-834f-4e35-b327-ea844ed1dcd3) from Sidekiq(hypatia_development_default) enqueued at 2022-05-23T20:21:34Z with arguments: "https://www.instagram.com/p/Cd3-yVDOsLU7nV-8w_cfbhk81PCkTeRdH4itjI0/", "2a615be7-0382-4ec8-9444-2cde1682c7ac", "https://gentle-cars-own-70-113-133-2.loca.lt"
2022-05-23T20:21:54.507Z pid=23540 tid=5b4 class=ScrapeJob jid=a5c06aa86048593f267b2e3e ERROR: Error performing ScrapeJob (Job ID: 41fd15be-834f-4e35-b327-ea844ed1dcd3) from Sidekiq(hypatia_development_default) in 19580.27ms: NoMethodError (undefined method `[]' for nil:NilClass

      unless graphql_object["items"][0].has_key?("video_versions")
                                    ^^^):
/Users/justin/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/bundler/gems/zorki-ed52f5cce436/lib/zorki/scrapers/post_scraper.rb:26:in `parse'
/Users/justin/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/bundler/gems/zorki-ed52f5cce436/lib/zorki/post.rb:42:in `block in scrape'
/Users/justin/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/bundler/gems/zorki-ed52f5cce436/lib/zorki/post.rb:41:in `map'
/Users/justin/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/bundler/gems/zorki-ed52f5cce436/lib/zorki/post.rb:41:in `scrape'
/Users/justin/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/bundler/gems/zorki-ed52f5cce436/lib/zorki/post.rb:12:in `lookup'
/Users/justin/Projects/duke/hypatia/app/media_sources/instagram_media_source.rb:54:in `retrieve_instagram_post'
/Users/justin/Projects/duke/hypatia/app/media_sources/instagram_media_source.rb:27:in `extract'
/Users/justin/Projects/duke/hypatia/app/media_sources/media_source.rb:35:in `scrape!'
/Users/justin/Projects/duke/hypatia/app/jobs/scrape_job.rb:9:in `perform'

Update: I now realize (comment captured below) that this was because my friend's account was private. I do think we should try to determine this and fail more gracefully; otherwise, our submit queue is gonna stack up with a bunch of unreachable URLs.

reefdog commented 2 years ago

Oh, lol. My friend's account is private. Didn't realize that until just now! Updating the framing of this issue…