read.socrata() fails on some datasets due to missing response headers from server

pschmied commented 7 years ago

Per @chrismetcalf 's request, I'm posting this here and tagging him to take a look.

I've found that some datasets are returned with a set of response headers that cause read.socrata() to error out. I was led to believe the issue has to do with the type of backend (new vs. old) on the Socrata side.

See below for full details, but the issue seems to be that the response header x-soda2-types isn't getting returned. RSocrata tries to extract the column types from the missing response header, and fails.

library(RSocrata)

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin16.1.0 (64-bit)
## Running under: macOS Sierra 10.12.1

## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] RSocrata_1.7.1-24

## loaded via a namespace (and not attached):
## [1] compiler_3.3.2 httr_1.2.1     plyr_1.8.4     R6_2.2.0       Rcpp_0.12.7   
## [6] jsonlite_1.1   mime_0.5

## Example earthquake dataset referenced in RSocrata docs
ex_url <- "https://soda.demo.socrata.com/resource/4334-bgaj.csv"

## This is an example dataset that I believe to be using Socrata's new back end.
nbe_url <- "https://odn.data.socrata.com/resource/pvug-y23y.csv"

## Loads as expected
ex_df <- read.socrata(ex_url)

## Errors out: "Error: Argument 'txt' must be a JSON string, URL or file."
nbe_df <- read.socrata(nbe_url)

traceback()
## 5: base::stop(..., call. = FALSE)
## 4: stop("Argument 'txt' must be a JSON string, URL or file.")
## 3: jsonlite::fromJSON(response$headers[["x-soda2-types"]])
## 2: getSodaTypes(response)
## 1: read.socrata(nbe_url)

## Get http responses for comparison
ex_resp <- httr::GET(ex_url)
nbe_resp <- httr::GET(nbe_url)

## Earthquake dataset response headers include `x-soda2-types`
names(ex_resp$headers)
##  [1] "server"                      "date"                       
##  [3] "content-type"                "transfer-encoding"          
##  [5] "connection"                  "x-socrata-requestid"        
##  [7] "access-control-allow-origin" "last-modified"              
##  [9] "x-soda2-warning"             "x-soda2-fields"             
## [11] "x-soda2-types"               "x-soda2-legacy-types"       
## [13] "age"                         "x-socrata-region"           
## [15] "content-encoding"           

## NBE-backed entities dataset does not
names(nbe_resp$headers)
##  [1] "server"                          "date"                           
##  [3] "content-type"                    "transfer-encoding"              
##  [5] "connection"                      "x-socrata-requestid"            
##  [7] "access-control-allow-origin"     "x-soda2-data-out-of-date"       
##  [9] "x-soda2-truth-last-modified"     "x-soda2-secondary-last-modified"
## [11] "last-modified"                   "age"                            
## [13] "x-socrata-region"                "content-encoding"

tomschenkjr commented 7 years ago

Thanks @pschmied, I can replicate locally and appears to be a valid bug.

@chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

tomschenkjr commented 7 years ago

@Chicago/rsocrata-devs - I've created a new branch issue118 that includes a unit test for this particular issue. The version is bumped to 1.7.2-1. On Monday we can discuss if this is a potential hot fix or is wrapped into our 1.8.0 release since this deals with new API. Our documentation may need to be more deliberate in discussing the APIs we've specifically tested in the past.

Looping-in @levyj to help navigate the NBE lingo.

@nicklucius - can you look at this code?

pschmied commented 7 years ago

@tomschenkjr My hunch is that pvug-y23y WILL change. Maybe we could upload a classic, open, very unlikely to change dataset such as iris as a reference. I joined Socrata a few days ago, so I can talk to folks here to see if that would be appropriate.

Before modifying the RSocrata package, we might see if @chrismetcalf can verify that indeed the correct set of headers are being returned from the Socrata side. If NBE-backed datasets are supposed to be returned with type information, then that would suggest a fix on our end.

tomschenkjr commented 7 years ago

@pschmied - first, congrats on the new job! :champagne:

I agree on having some test data sets that we can use for unit testing and suggested it in #45. We've been using the Earthquake example--and it's been fine--but having one for different versions of SoDA would be helpful.

We'll hold off on work for now (unassigning @levyj and @nicklucius for now).

nicklucius commented 7 years ago

@tomschenkjr - I just saw your last message, but was about to post this:

I agree this code below causes an error because the response header x-soda2-types is not getting returned, as @pschmied reported above.

nbe_url <- "https://odn.data.socrata.com/resource/pvug-y23y.csv"
nbe_df <- read.socrata(nbe_url)

read.socrata() uses the data types found in x-soda2-types to trigger posixify() to run on date/time columns and no_deniro() to run on money columns.

As a fix, for any URL where x-soda2-types turns out to be missing, I could redirect the getSodaTypes() function to find the data types by changing the URL from /resource/ to /views/. In this example, we can get the data types for https://odn.data.socrata.com/resource/pvug-y23y.csv as follows:

Change the URL to https://odn.data.socrata.com/views/pvug-y23y to get these results

{
  "id" : "pvug-y23y",
  "name" : "Entities",
  "averageRating" : 0,
  "createdAt" : 1473714769,
  "displayType" : "table",
  "downloadCount" : 11,
  "hideFromCatalog" : false,
  "hideFromDataJson" : false,
  "indexUpdatedAt" : 1474044590,
  "newBackend" : true,
  "numberOfComments" : 0,
  "oid" : 23753334,
  "provenance" : "official",
  "publicationAppendEnabled" : true,
  "publicationDate" : 1473714855,
  "publicationGroup" : 13035076,
  "publicationStage" : "published",
  "rowClass" : "",
  "rowIdentifierColumnId" : 269859293,
  "rowsUpdatedAt" : 1474481590,
  "rowsUpdatedBy" : "7krv-k7t3",
  "tableId" : 13035076,
  "totalTimesRated" : 0,
  "viewCount" : 50,
  "viewLastModified" : 1474044581,
  "viewType" : "tabular",
  "columns" : [ {
    "id" : 269859293,
    "name" : "id",
    "dataTypeName" : "text",
    "fieldName" : "id",
    "position" : 1,
    "renderTypeName" : "text",
    "tableColumnId" : 38329919,
    "width" : 186,
    "format" : { }
  }, {
    "id" : 269859296,
    "name" : "name",
    "dataTypeName" : "text",
    "fieldName" : "name",
    "position" : 2,
    "renderTypeName" : "text",
    "tableColumnId" : 38329922,
    "width" : 443,
    "format" : { }
  }, {
    "id" : 269859294,
    "name" : "type",
    "dataTypeName" : "text",
    "fieldName" : "type",
    "position" : 3,
    "renderTypeName" : "text",
    "tableColumnId" : 38329920,
    "width" : 183,
    "format" : { }
  }, {
    "id" : 269859295,
    "name" : "rank",
    "dataTypeName" : "number",
    "fieldName" : "rank",
    "position" : 4,
    "renderTypeName" : "number",
    "tableColumnId" : 38329921,
    "width" : 147,
    "format" : { }
  } ],

Use the dataTypeName in place of x-soda2-types

pschmied commented 7 years ago

Thank you, @tomschenkjr ! I'm very excited--have been beneficiary of Socrata-backed open-data portals for some time now.

As for @nicklucius 's approach, that seems very reasonable. Sadly, I'm too new on the job to know what is the better long-term solution. That's where I'll rely on Chris, who'll have a much better sense of where the API is, and where it's likely to go ;-)

chrismetcalf commented 7 years ago

Catching up...

@chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit and set an assertion on that.

@tomschenkjr @levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.

For soda-ruby, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.

tomschenkjr commented 7 years ago

@chrismetcalf - 10-4 on not relying row counts. Its be very helpful to have static data sets for testing with known conditions.

For the underlying issues, is the SoDA header in NBE returning the correct headers in this case?

On Fri, Nov 18, 2016 at 6:34 PM -0600, "Chris Metcalf" notifications@github.com<mailto:notifications@github.com> wrote:

Catching up...

@chrismetcalfhttps://github.com/chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.

I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit and set an assertion on that.

@tomschenkjrhttps://github.com/tomschenkjr @levyjhttps://github.com/levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.

For soda-ruby, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-261677483, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0TGYXiC9nPAle74poYxiHZIldBoGks5q_kPYgaJpZM4K0cmO.

This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

chrismetcalf commented 7 years ago

It does appear that the X-SODA2-Fields and X-SODA2-Types headers are omitted when you select the .csv output type. I'm not sure why they'd be gone in that case, so I'll create a ticket with our Engineering team to see what's up.

tomschenkjr commented 7 years ago

Hey @chrismetcalf - just wanted to follow-up on this to see if this was an error in the return headers or if we shouldn't be expecting those X-SODA2- entries anymore.

chrismetcalf commented 7 years ago

@tomschenkjr I just bumped the JIRA ticket with our engineering team to see if I can get an update.

Internal ticket (you won't be able to open that link): https://socrata.atlassian.net/browse/EN-12140

tomschenkjr commented 7 years ago

Thanks!

On Tue, Dec 6, 2016 at 2:11 AM -0300, "Chris Metcalf" notifications@github.com<mailto:notifications@github.com> wrote:

@tomschenkjrhttps://github.com/tomschenkjr I just bumped the JIRA ticket with our engineering team to see if I can get an update.

Internal ticket (you won't be able to open that link): https://socrata.atlassian.net/browse/EN-12140

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-265062381, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0RQWjbe4RTq-JhjUXGRLYfgNHT8uks5rFO5ggaJpZM4K0cmO.

This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

tomschenkjr commented 7 years ago

@chrismetcalf - just following-up on this.

chrismetcalf commented 7 years ago

@tomschenkjr I just checked in on the ticket and it's still in the queue. I bumped the priority, but if you file a ticket with Support and ask for it to be linked to EN-12140, that'll probably get it more attention.

tomschenkjr commented 7 years ago

Will do.

From: Chris Metcalf [mailto:notifications@github.com] Sent: Tuesday, January 10, 2017 2:36 PM To: Chicago/RSocrata RSocrata@noreply.github.com Cc: Schenk, Tom Tom.Schenk@cityofchicago.org; Mention mention@noreply.github.com Subject: Re: [Chicago/RSocrata] read.socrata() fails on some datasets due to missing response headers from server (#118)

@tomschenkjrhttps://github.com/tomschenkjr I just checked in on the ticket and it's still in the queue. I bumped the priority, but if you file a ticket with Support and ask for it to be linked to EN-12140, that'll probably get it more attention.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-271690543, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0UmgMdj54lSLeqSBudGEOA19wN97ks5rQ-vBgaJpZM4K0cmO.

This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.

chrismetcalf commented 7 years ago

Thanks @tomschenkjr

tomschenkjr commented 7 years ago

Received feedback from @socrata support team and appears the missing X-SODA- parameters was an error. Closing this issue since it's not related to RSocrata directly and appears the engineering team is fixing the API issue.

Thanks for reporting!

chrismetcalf commented 7 years ago

@tomschenkjr FYI, just got the notification that the header fix for those datasets just hit staging, so a fix for the bug on our end should be out soon! Great seeing you guys yesterday!

tomschenkjr commented 7 years ago

This has been resolved by the engineering team. Just let me know if you see it again.

hrecht commented 7 years ago

Hi, I'm getting this same error on two datasets, https://data.healthcare.gov/dataset/2016-QHP-Landscape-Individual-Market-Medical/v7sn-c66v/data and https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6 I've tested a few other datasets from the same site and those read without errors.

> myurl <- "https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6"
> df <- read.socrata(myurl)
Error: Argument 'txt' must be a JSON string, URL or file.
> traceback()
5: base::stop(..., call. = FALSE)
4: stop("Argument 'txt' must be a JSON string, URL or file.")
3: jsonlite::fromJSON(response$headers[["x-soda2-types"]])
2: getSodaTypes(response)
1: read.socrata(myurl)

I also tried reading from the csv and json resource urls for these datasets with no change in error message.

tomschenkjr commented 7 years ago

Thanks, @hrecht. This seems to be a @Socrata server-side issue. I will ping them to let them know the issue still exists in other data sets--I am not sure of the underlying cause.

If it persists or this is being encountered at a high-rate, we may need to release a bug-fix of sorts that does not do this validation; at least, throws a warning and not a fatal error.

hrecht commented 7 years ago

Thanks @tomschenkjr! Understood that it's a @Socrata issue rather than RSocrata issue, hopefully they'll be able to fix.

chrismetcalf commented 7 years ago

@tomschenkjr @hrecht I'm following up with the team internally and we'll get back to you once we know more.

@tomschenkjr How reliant are you on the X-SODA2-Fields header in RSocrata? In this case, it looks like we might be deliberately omitting the *-Fields header since that dataset includes so many columns that just the column names alone exceed's NGinx's 4K limit on total header size. See the X-SODA2-Warning: X-SODA2-Fields, X-SODA2-Types, and X-SODA2-Legacy-Types are deprecated and have been suppressed for being too large header. We might need a different mechanism.

tomschenkjr commented 7 years ago

@chrismetcalf - we use it to retrieve the data types being used. The most important is using it to identify dates and then converting them into a proper data column in R. Less important, is we also use it to identify money fields so R handles it as a float/numeric field instead of a string.

If this was deprecated, we could rely on guessing the field types.

tomschenkjr commented 7 years ago

Reopening for now since we are revisiting a workaround for this.

chrismetcalf commented 7 years ago

Following up on my comment above, we are in fact conditionally including the X-SODA2-* headers when they'd be long enough to push the header over the 4K limit. So I wouldn't rely on them when building a general-purpose library. I'm going to clarify that in dev.socrata.com so hopefully nobody else gets hung up on it: https://github.com/socrata/dev.socrata.com/issues/677

Now we need another, more reliable way to get you the column types. There are of course unsupported ways of getting this, but I'd really rather have you on something real, or pursue the guessing option.

How often are people hitting this issue? It should only be happening for very wide datasets or datasets with very long column names? Another workaround would be to specify a $select to limit the number of columns returned.

nicklucius commented 7 years ago

Our development team sketched out how RSocrata will handle this: If X-SODA2-* headers are not present, we will make a change so that read.socrata() does not fail. Instead, it will not run posixify() or no_deniro() and will return date and currency fields as character with a warning. That way, users still get the data they are seeking.

I'll get started on the development for this fix.

nicklucius commented 7 years ago

@tomschenkjr - Just noting this to help explain my changes: I rebased the issue118 branch since it was a few months behind dev. Then after the fix was completed in issue118, I merged it directly to dev according to the new Git Flow.

tomschenkjr commented 7 years ago

@nicklucius - sounds good. I'm handing the issue over to you for development. Good night, good luck :vulcan_salute:

geneorama commented 5 years ago

The test for this started failing recently.

The section of code:

test_that("Warn instead of fail if X-SODA2-* headers are missing", {
  expect_warning(dfCsv <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.csv?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_warning(dfJson <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.json?$limit=1000"),
                info="https://github.com/Chicago/RSocrata/issues/118")
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.csv"))
  expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.json"))
  expect_equal("data.frame", class(dfCsv), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal("data.frame", class(dfJson), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
  expect_equal(140, ncol(dfJson), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
})

The actual failing test message:

>   expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
Error: columns not equal to ncol(dfCsv).
1/1 mismatches
[1] 150 - 146 == 4
https://github.com/Chicago/RSocrata/issues/118

I thought this might be useful, but it didn't help me:

> setdiff(colnames(dfJson), colnames(dfCsv))
[1] "url"   "url.1" "url.2" "url.3"
> setdiff(colnames(dfCsv), colnames(dfJson))
 [1] "network_url"                                          
 [2] "plan_brochure_url"                                    
 [3] "summary_of_benefits_url"                              
 [4] "drug_formulary_url"                                   
 [5] "adult_dental"                                         
 [6] "premium_scenarios"                                    
 [7] "standard_plan_cost_sharing"                           
 [8] "X_73_percent_actuarial_value_silver_plan_cost_sharing"
 [9] "X_87_percent_actuarial_value_silver_plan_cost_sharing"
[10] "X_94_percent_actuarial_value_silver_plan_cost_sharing"

It looks like the URL columns are different in name only, but the other six columns are missing in the JSON. Not sure if this is related to this issue, or if this is something else?

nicklucius commented 5 years ago

@geneorama - changing the value to 146, which then lines up the test expectation with the incoming data, removes the error. If there was a problem related to this issue #118, there would still be an error thrown. Therefore, I think the problem is that the number of columns returned by Socrata has changed and that is what caused the issue to fail. As such, nothing I'm seeing shows that there's anything related to #118.

Chicago / RSocrata

read.socrata() fails on some datasets due to missing response headers from server #118