Suggestions for Ch2 - Reading data

joelostblom commented 2 years ago

[x] Add north/south vs gps corordinsdtea exmaples to illustrate relative and absolute paths
[x] Don't emphasize "delimiter", it is referred to as "separator" in python.
[x] Use text yntax highlighting instead of code for the preview of the file content
[x] There are no "normal and expected messages" when reading in a file successfully with pandas
[x] "how many lines to skip" -> "how many ROWS to skip"
[x] Explain header = None
[x] "So we needed to use different tools for the job" we use the same tool in Python
[x] Don't use inplace with rename (or any pandas function, it is discouraged and will be deprecated)
- [x] Also format this more properly
[x] R - The learnobj says we should learn about col_names but we only ever use rename
[x] Note potential issue with having to instal openxl separately for reading excel files with pandas
[x] Update table at the end to remove read_table
[x] We don't explain autoload and autoload_with
[x] Why use head and tail first, instead of shape twice (more natural)?
[x] Learning how to read and write records-formatted json files could be a valuable addition.
[x] Fix formatting of pip install output
[x] R & Py - Is the selector gadget better than built-in inspectors?
[x] Format craigslist HTML with correct syntax highlighting
[x] Pandas read_html page FileNotFoundError, explain "droplevels"
- [x] I think we should also build up this method more to highlights its immense convenience, and not say that it is "fantastic" to read via beautiful soup.
[x] I think scrapy is both more powerful and intuitive than beautiful soup
[x] Twitter images broken
[x] Introduce the print function and for loops before using them for tweepy
[x] Another data file not found for the tweets
[x] Is this SQL approach really easier than using pd.read_sql? For example (more examples here):
```
import sqlalchemy as sqla
db = sqla.create_engine("sqlite:///mydata.sqlite")
pd.read_sql("SELECT * FROM test", db)
```
- If we really want to remove all SQL syntax, we could use IBIS for a more pandas-like syntax, but I don't have experience with that personally https://ibis-project.org/docs/3.1.0/#features
[x] #36

trevorcampbell commented 1 year ago

[x] look through read_csv (and to_csv) documentation and see if there are any other useful arguments to discuss (e.g. relating to indices)

trevorcampbell commented 1 year ago

@joelostblom after a brief skim, ibis looks super neat. I am kind of tempted to switch to that, given some more investigation. And maybe mention to students the option to send raw SQL to the DB via pd.read_sql in a note box or something like that.

I will look at scrapy vs beautifulsoup shortly

joelostblom commented 1 year ago

Yeah, ibis really looks impressive. My hesitation is that I don't know anyone who uses it, so I don't have good insight into corner cases or real life experience/feedback.

trevorcampbell commented 1 year ago

I just played a bit with ibis now. It's way easier to use and more natural than sqlalchemy. I would be worried if we were doing advanced stuff, but since our course just does very simple select/filter/execute, I am going to switch us over.

Thanks for the suggestion!

trevorcampbell commented 1 year ago

I also am commenting out the web scraping and API stuff for this round, since we have more important things to handle for Jan. Issue opened to reintroduce it later #64

joelostblom commented 1 year ago

look through read_csv (and to_csv) documentation and see if there are any other useful arguments to discuss (e.g. relating to indices)

Just adding to this, the ones I used the most often that we have not covered are skipinitialspace and parse_dates. I think chunksize could be useful too. Having that said, I am unsure if they fit in this intro chapter (and maybe not at all in the book), or could maybe be part of the data cleaning chapter (at least the first two)?

UBC-DSCI / introduction-to-datascience-python

Suggestions for Ch2 - Reading data #38