Chicago / RSocrata

Provides easier interaction with Socrata open data portals http://dev.socrata.com. Users can provide a 'Socrata' data set resource URL, or a 'Socrata' Open Data API (SoDA) web query, or a 'Socrata' "human-friendly" URL, returns an R data frame. Converts dates to 'POSIX' format. Manages throttling by 'Socrata'.
https://CRAN.R-project.org/package=RSocrata
Other
236 stars 84 forks source link

Number of columns not consistent for JSON and CSV from Socrata #184

Open nicklucius opened 4 years ago

nicklucius commented 4 years ago

The test for this started failing recently.

The section of code:

test_that("Warn instead of fail if X-SODA2-* headers are missing", {
  expect_warning(dfCsv <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.csv?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_warning(dfJson <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.json?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.csv"))
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.json"))
  expect_equal("data.frame", class(dfCsv), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal("data.frame", class(dfJson), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(140, ncol(dfJson), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
})

The actual failing test message:

>   expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
Error: columns not equal to ncol(dfCsv).
1/1 mismatches
[1] 150 - 146 == 4
https://github.com/Chicago/RSocrata/issues/118

I thought this might be useful, but it didn't help me:

> setdiff(colnames(dfJson), colnames(dfCsv))
[1] "url"   "url.1" "url.2" "url.3"
> setdiff(colnames(dfCsv), colnames(dfJson))
 [1] "network_url"                                          
 [2] "plan_brochure_url"                                    
 [3] "summary_of_benefits_url"                              
 [4] "drug_formulary_url"                                   
 [5] "adult_dental"                                         
 [6] "premium_scenarios"                                    
 [7] "standard_plan_cost_sharing"                           
 [8] "X_73_percent_actuarial_value_silver_plan_cost_sharing"
 [9] "X_87_percent_actuarial_value_silver_plan_cost_sharing"
[10] "X_94_percent_actuarial_value_silver_plan_cost_sharing"

It looks like the URL columns are different in name only, but the other six columns are missing in the JSON. Not sure if this is related to this issue, or if this is something else?

Originally posted by @geneorama in https://github.com/Chicago/RSocrata/issues/118#issuecomment-543835796

geneorama commented 4 years ago

I don't understand exactly why we're testing the number of columns for #118.

In fact, I don't understand what was special about these data sets was causing what error. I see the comment from @hrect https://github.com/Chicago/RSocrata/issues/118#issuecomment-280721435 that these data sets replicate an error, but I don't know why they were causing the error.

@nicklucius do you know?

nicklucius commented 4 years ago

@geneorama this dataset has so many columns that including the names/types in the html header would drive the header size over their limit. So Socrata omits the names/types from the header in this case. That used to break read.socrata() and throw an error. The development in #118 fixed the break so that it warns the user and then coerces to character.