Closed RhoBott closed 2 years ago
I think maybe the robotstxt is helpful to ensure that you can use the data on this page, and it seems like rvest
can't do that on its own, so probably yes we did mean to recommend it. I am not aware of another way to check for permission so I would say we keep it.
In terms of reworking the text...
-I feel like it is clear to me and I looked at this for the first time at the beginning of the semester
-it does feel a little wordy so we can make it more compact by:
--"NOTE: Before you take data from a website, make sure you are allowed to scrape and analyze the data [using robotstxt
]." and then adding the two lines of code in one box.
I'd recommend making a subheader for the robotstxt section called something like "Checking for scraping permission" and then a second one that explains how to actually do the scraping
seconding josh - I'd say keep it in, clean it up, and make it its own subhead so its clear that the robotstxt check is separate from the actual web scraping that happens.
I'm wondering if we could maybe include some visual aids to illustrate the scraping process on this page. Just the written narrative is kind of unclear/confusing.
@joshyam-k / @zolli22 could you take a look at this Tuesday (2021/11/02) if/when you have a minute, and then have later-in-the-day folks (@zolli22 / @avawillis ) be a second set of eyes ... at which point we can add that updated text to what @avawillis and @anaqb will be working on, perhaps, next week?
@avawillis I am not anti-visual aids - maybe that's a new issue / goes on a Future Selves wishlist? (I think we'll always be tweaking these pages, as is the way of Proper Living Documentation...?)
(( this comment brought to you by excessive at-ing ))
just opened a Pull-request for this! We can all work inside of this PR by
02/loading-data/from_internet.Rmd
Let me know if you have any questions about all of this : -)
the current version of our walking folks through how to scrape + bring in data has me ... confused. (for ease of reading, public-facing page )
points of confusion include:
again -- comments fr all here by end of day Tuesday 2021/10/26, please + thank you + high-five.