csgillespie / efficientR

Efficient R programming: a book
https://csgillespie.github.io/efficientR/
Other
719 stars 375 forks source link

5.3.1: read_tsv example with unexpected parsing error #282

Closed engineerchange closed 3 years ago

engineerchange commented 4 years ago

Running code in Section 5.3.1:

fname = system.file("extdata/voc_voyages.tsv", package = "efficient")
voyages_readr = readr::read_tsv(fname)

I get these unexpected warnings that differ from the "expected warning" on row 2841 in this section.

image

Looking at the voc_voyages.tsv file within the extdata directory of the efficient package, we can see there are some unexpected tab separators in these affected rows (e.g., 1023 and 1025); particularly, that there are 3 tabs in these rows (as opposed to 2) preceding the bought field, which throws the numeric column bought into the logical column hired: image

Robinlovelace commented 4 years ago

Interesting. Any ideas what is going on or how to fix the issues?

engineerchange commented 4 years ago

I'm able to replicate with a fresh download from the MonetDB-R website (click on "VOC dataset" about halfway down the page).

My interpretation above is a bit wrong; read_tsv expects it to be logical by the 1000th line, but it is in fact a character, which causes the error.

image

If I do some guess_max changes, I can resolve this parsing error, and then other parsing errors surface, but row 2841 does not appear, as suggested in 5.3.1.

image

In fact, I don't see an error on row 2841 at all from the start.

image

@Robinlovelace are you able to replicate? I'm unclear on an immediate solution, but likely this portion of the section may need to get rewritten. Unclear how this behaviour would be different from years ago when running the same file.

Robinlovelace commented 4 years ago

Hmm. Interesting. Do you know what the intended classes of the data frames was? It's an excellent example of a tricky and large dataset to read-in and makes me wonder how other packages such as vroom and data.table would handle it. I have not had a chance to look, am also not sure how this worked years ago, but welcome any suggestions, it's a nice dataset for testing, that's for sure!

Many thanks for flagging this btw.

alwaysandeep commented 3 years ago

@Robinlovelace I too recently came across this issue when I was going through this really helpful book. So, I gave it a try rewriting this section with changes in #294. Please take a look.

In regards to intended classes for the voyages data frame that you raised above, I found this article that may deem interesting
voyages relational table

Robinlovelace commented 3 years ago

Many thanks for the re-write @alwaysandeep, looks good to me! Awaiting feedback from @csgillespie on this.

engineerchange commented 3 years ago

Very elegant fix! 🔥