RhoBott / data_at_reed

A repository to manage the D@R update!
2 stars 0 forks source link

rvest and robots and confused kbott, oh my! #77

Closed RhoBott closed 2 years ago

RhoBott commented 2 years ago

the current version of our walking folks through how to scrape + bring in data has me ... confused. (for ease of reading, public-facing page )

points of confusion include:

  1. do we mean to recommend this robotstxt business? (if so, this is news to me, but ... neat?)
  2. if we do intend to (1), i think we may need to clarify / rework this text a bit
  3. if we do not intend to (1), neat! deleting is easy; i wanted to confer before making anything Go Away

again -- comments fr all here by end of day Tuesday 2021/10/26, please + thank you + high-five.

anaqb commented 2 years ago

I think maybe the robotstxt is helpful to ensure that you can use the data on this page, and it seems like rvest can't do that on its own, so probably yes we did mean to recommend it. I am not aware of another way to check for permission so I would say we keep it.

In terms of reworking the text... -I feel like it is clear to me and I looked at this for the first time at the beginning of the semester -it does feel a little wordy so we can make it more compact by: --"NOTE: Before you take data from a website, make sure you are allowed to scrape and analyze the data [using robotstxt]." and then adding the two lines of code in one box.

joshyam-k commented 2 years ago

I'd recommend making a subheader for the robotstxt section called something like "Checking for scraping permission" and then a second one that explains how to actually do the scraping

zolli22 commented 2 years ago

seconding josh - I'd say keep it in, clean it up, and make it its own subhead so its clear that the robotstxt check is separate from the actual web scraping that happens.

avawillis commented 2 years ago

I'm wondering if we could maybe include some visual aids to illustrate the scraping process on this page. Just the written narrative is kind of unclear/confusing.

RhoBott commented 2 years ago

@joshyam-k / @zolli22 could you take a look at this Tuesday (2021/11/02) if/when you have a minute, and then have later-in-the-day folks (@zolli22 / @avawillis ) be a second set of eyes ... at which point we can add that updated text to what @avawillis and @anaqb will be working on, perhaps, next week?

@avawillis I am not anti-visual aids - maybe that's a new issue / goes on a Future Selves wishlist? (I think we'll always be tweaking these pages, as is the way of Proper Living Documentation...?)

(( this comment brought to you by excessive at-ing ))

joshyam-k commented 2 years ago

just opened a Pull-request for this! We can all work inside of this PR by

  1. opening the data_at_reed project in your rstudio install
  2. "pull"-ing to make sure your local project is up to date
  3. going to the section In the rstudio interface with "Environment" "History" "Connections" etc and clicking on 'main' which should be next to "New Branch" (this should be In the top right hand corner of the pane)
  4. selecting "ro(bott)s-and-rvest"
  5. making changes inside of 02/loading-data/from_internet.Rmd
  6. commit, push, done!

Let me know if you have any questions about all of this : -)