Closed pschmied closed 7 years ago
Thanks @pschmied, I can replicate locally and appears to be a valid bug.
@chrismetcalf - do you think pvug-y23y
will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.
@Chicago/rsocrata-devs - I've created a new branch issue118
that includes a unit test for this particular issue. The version is bumped to 1.7.2-1
. On Monday we can discuss if this is a potential hot fix or is wrapped into our 1.8.0 release since this deals with new API. Our documentation may need to be more deliberate in discussing the APIs we've specifically tested in the past.
Looping-in @levyj to help navigate the NBE lingo.
@nicklucius - can you look at this code?
@tomschenkjr My hunch is that pvug-y23y WILL change. Maybe we could upload a classic, open, very unlikely to change dataset such as iris
as a reference. I joined Socrata a few days ago, so I can talk to folks here to see if that would be appropriate.
Before modifying the RSocrata package, we might see if @chrismetcalf can verify that indeed the correct set of headers are being returned from the Socrata side. If NBE-backed datasets are supposed to be returned with type information, then that would suggest a fix on our end.
@pschmied - first, congrats on the new job! :champagne:
I agree on having some test data sets that we can use for unit testing and suggested it in #45. We've been using the Earthquake example--and it's been fine--but having one for different versions of SoDA would be helpful.
We'll hold off on work for now (unassigning @levyj and @nicklucius for now).
@tomschenkjr - I just saw your last message, but was about to post this:
I agree this code below causes an error because the response header x-soda2-types
is not getting returned, as @pschmied reported above.
nbe_url <- "https://odn.data.socrata.com/resource/pvug-y23y.csv"
nbe_df <- read.socrata(nbe_url)
read.socrata()
uses the data types found in x-soda2-types
to trigger posixify()
to run on date/time columns and no_deniro()
to run on money columns.
As a fix, for any URL where x-soda2-types
turns out to be missing, I could redirect the getSodaTypes()
function to find the data types by changing the URL from /resource/
to /views/
. In this example, we can get the data types for https://odn.data.socrata.com/resource/pvug-y23y.csv
as follows:
https://odn.data.socrata.com/views/pvug-y23y
to get these results{
"id" : "pvug-y23y",
"name" : "Entities",
"averageRating" : 0,
"createdAt" : 1473714769,
"displayType" : "table",
"downloadCount" : 11,
"hideFromCatalog" : false,
"hideFromDataJson" : false,
"indexUpdatedAt" : 1474044590,
"newBackend" : true,
"numberOfComments" : 0,
"oid" : 23753334,
"provenance" : "official",
"publicationAppendEnabled" : true,
"publicationDate" : 1473714855,
"publicationGroup" : 13035076,
"publicationStage" : "published",
"rowClass" : "",
"rowIdentifierColumnId" : 269859293,
"rowsUpdatedAt" : 1474481590,
"rowsUpdatedBy" : "7krv-k7t3",
"tableId" : 13035076,
"totalTimesRated" : 0,
"viewCount" : 50,
"viewLastModified" : 1474044581,
"viewType" : "tabular",
"columns" : [ {
"id" : 269859293,
"name" : "id",
"dataTypeName" : "text",
"fieldName" : "id",
"position" : 1,
"renderTypeName" : "text",
"tableColumnId" : 38329919,
"width" : 186,
"format" : { }
}, {
"id" : 269859296,
"name" : "name",
"dataTypeName" : "text",
"fieldName" : "name",
"position" : 2,
"renderTypeName" : "text",
"tableColumnId" : 38329922,
"width" : 443,
"format" : { }
}, {
"id" : 269859294,
"name" : "type",
"dataTypeName" : "text",
"fieldName" : "type",
"position" : 3,
"renderTypeName" : "text",
"tableColumnId" : 38329920,
"width" : 183,
"format" : { }
}, {
"id" : 269859295,
"name" : "rank",
"dataTypeName" : "number",
"fieldName" : "rank",
"position" : 4,
"renderTypeName" : "number",
"tableColumnId" : 38329921,
"width" : 147,
"format" : { }
} ],
dataTypeName
in place of x-soda2-types
Thank you, @tomschenkjr ! I'm very excited--have been beneficiary of Socrata-backed open-data portals for some time now.
As for @nicklucius 's approach, that seems very reasonable. Sadly, I'm too new on the job to know what is the better long-term solution. That's where I'll rely on Chris, who'll have a much better sense of where the API is, and where it's likely to go ;-)
Catching up...
@chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.
I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit
and set an assertion on that.
@tomschenkjr @levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.
For soda-ruby
, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.
@chrismetcalf - 10-4 on not relying row counts. Its be very helpful to have static data sets for testing with known conditions.
For the underlying issues, is the SoDA header in NBE returning the correct headers in this case?
_ Tom Schenk Jr. Chief Data Officer Department of Innovation and Technology City of Chicago (312) 744-2770 tom.schenk@cityofchicago.org | @ChicagoCDO data.cityofchicago.org | opengrid.io | digital.cityofchicago.org | chicago.github.io |dev.cityofchicago.org
On Fri, Nov 18, 2016 at 6:34 PM -0600, "Chris Metcalf" notifications@github.com<mailto:notifications@github.com> wrote:
Catching up...
@chrismetcalfhttps://github.com/chrismetcalf - do you think pvug-y23y will always have the same number of rows? Want to create a unit test for SoDA 2 support and curious if I can always check this data set has 68,087 rows.
I wouldn't do that, since the dataset is likely to change time. However, you could always create a unit test that passes a known $limit and set an assertion on that.
@tomschenkjrhttps://github.com/tomschenkjr @levyjhttps://github.com/levyj In general, would it be good to have a couple of datasets that I maintain in a "known state" for creating test suites against? I could easily set up a couple of datasets somewhere that you could use to test different conditions.
For soda-ruby, I've stubbed out the actual HTTP calls, but it might be good to have some integration tests too.
You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-261677483, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0TGYXiC9nPAle74poYxiHZIldBoGks5q_kPYgaJpZM4K0cmO.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
It does appear that the X-SODA2-Fields
and X-SODA2-Types
headers are omitted when you select the .csv
output type. I'm not sure why they'd be gone in that case, so I'll create a ticket with our Engineering team to see what's up.
Hey @chrismetcalf - just wanted to follow-up on this to see if this was an error in the return headers or if we shouldn't be expecting those X-SODA2-
entries anymore.
@tomschenkjr I just bumped the JIRA ticket with our engineering team to see if I can get an update.
Internal ticket (you won't be able to open that link): https://socrata.atlassian.net/browse/EN-12140
Thanks!
_ Tom Schenk Jr. Chief Data Officer Department of Innovation and Technology City of Chicago (312) 744-2770 tom.schenk@cityofchicago.org | @ChicagoCDO data.cityofchicago.org | opengrid.io | digital.cityofchicago.org | chicago.github.io | dev.cityofchicago.org
On Tue, Dec 6, 2016 at 2:11 AM -0300, "Chris Metcalf" notifications@github.com<mailto:notifications@github.com> wrote:
@tomschenkjrhttps://github.com/tomschenkjr I just bumped the JIRA ticket with our engineering team to see if I can get an update.
Internal ticket (you won't be able to open that link): https://socrata.atlassian.net/browse/EN-12140
- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-265062381, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0RQWjbe4RTq-JhjUXGRLYfgNHT8uks5rFO5ggaJpZM4K0cmO.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
@chrismetcalf - just following-up on this.
@tomschenkjr I just checked in on the ticket and it's still in the queue. I bumped the priority, but if you file a ticket with Support and ask for it to be linked to EN-12140
, that'll probably get it more attention.
Will do.
From: Chris Metcalf [mailto:notifications@github.com] Sent: Tuesday, January 10, 2017 2:36 PM To: Chicago/RSocrata RSocrata@noreply.github.com Cc: Schenk, Tom Tom.Schenk@cityofchicago.org; Mention mention@noreply.github.com Subject: Re: [Chicago/RSocrata] read.socrata() fails on some datasets due to missing response headers from server (#118)
@tomschenkjrhttps://github.com/tomschenkjr I just checked in on the ticket and it's still in the queue. I bumped the priority, but if you file a ticket with Support and ask for it to be linked to EN-12140, that'll probably get it more attention.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Chicago/RSocrata/issues/118#issuecomment-271690543, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABkC0UmgMdj54lSLeqSBudGEOA19wN97ks5rQ-vBgaJpZM4K0cmO.
This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail (or the person responsible for delivering this document to the intended recipient), you are hereby notified that any dissemination, distribution, printing or copying of this e-mail, and any attachment thereto, is strictly prohibited. If you have received this e-mail in error, please respond to the individual sending the message, and permanently delete the original and any copy of any e-mail and printout thereof.
Thanks @tomschenkjr
Received feedback from @socrata support team and appears the missing X-SODA-
parameters was an error. Closing this issue since it's not related to RSocrata directly and appears the engineering team is fixing the API issue.
Thanks for reporting!
@tomschenkjr FYI, just got the notification that the header fix for those datasets just hit staging, so a fix for the bug on our end should be out soon! Great seeing you guys yesterday!
This has been resolved by the engineering team. Just let me know if you see it again.
Hi, I'm getting this same error on two datasets, https://data.healthcare.gov/dataset/2016-QHP-Landscape-Individual-Market-Medical/v7sn-c66v/data and https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6 I've tested a few other datasets from the same site and those read without errors.
> myurl <- "https://data.healthcare.gov/dataset/2017-QHP-Landscape-Individual-Market-Medical/enpz-m4q6"
> df <- read.socrata(myurl)
Error: Argument 'txt' must be a JSON string, URL or file.
> traceback()
5: base::stop(..., call. = FALSE)
4: stop("Argument 'txt' must be a JSON string, URL or file.")
3: jsonlite::fromJSON(response$headers[["x-soda2-types"]])
2: getSodaTypes(response)
1: read.socrata(myurl)
I also tried reading from the csv and json resource urls for these datasets with no change in error message.
Thanks, @hrecht. This seems to be a @Socrata server-side issue. I will ping them to let them know the issue still exists in other data sets--I am not sure of the underlying cause.
If it persists or this is being encountered at a high-rate, we may need to release a bug-fix of sorts that does not do this validation; at least, throws a warning and not a fatal error.
Thanks @tomschenkjr! Understood that it's a @Socrata issue rather than RSocrata issue, hopefully they'll be able to fix.
@tomschenkjr @hrecht I'm following up with the team internally and we'll get back to you once we know more.
@tomschenkjr How reliant are you on the X-SODA2-Fields
header in RSocrata? In this case, it looks like we might be deliberately omitting the *-Fields
header since that dataset includes so many columns that just the column names alone exceed's NGinx's 4K limit on total header size. See the X-SODA2-Warning: X-SODA2-Fields, X-SODA2-Types, and X-SODA2-Legacy-Types are deprecated and have been suppressed for being too large
header. We might need a different mechanism.
@chrismetcalf - we use it to retrieve the data types being used. The most important is using it to identify dates and then converting them into a proper data column in R. Less important, is we also use it to identify money fields so R handles it as a float/numeric field instead of a string.
If this was deprecated, we could rely on guessing the field types.
Reopening for now since we are revisiting a workaround for this.
Following up on my comment above, we are in fact conditionally including the X-SODA2-*
headers when they'd be long enough to push the header over the 4K limit. So I wouldn't rely on them when building a general-purpose library. I'm going to clarify that in dev.socrata.com so hopefully nobody else gets hung up on it: https://github.com/socrata/dev.socrata.com/issues/677
Now we need another, more reliable way to get you the column types. There are of course unsupported ways of getting this, but I'd really rather have you on something real, or pursue the guessing option.
How often are people hitting this issue? It should only be happening for very wide datasets or datasets with very long column names? Another workaround would be to specify a $select
to limit the number of columns returned.
Our development team sketched out how RSocrata will handle this: If X-SODA2-*
headers are not present, we will make a change so that read.socrata()
does not fail. Instead, it will not run posixify()
or no_deniro()
and will return date and currency fields as character
with a warning. That way, users still get the data they are seeking.
I'll get started on the development for this fix.
@tomschenkjr - Just noting this to help explain my changes: I rebased the issue118
branch since it was a few months behind dev
. Then after the fix was completed in issue118
, I merged it directly to dev
according to the new Git Flow.
@nicklucius - sounds good. I'm handing the issue over to you for development. Good night, good luck :vulcan_salute:
The test for this started failing recently.
The section of code:
test_that("Warn instead of fail if X-SODA2-* headers are missing", {
expect_warning(dfCsv <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.csv?$limit=1000"),
info="https://github.com/Chicago/RSocrata/issues/118")
expect_warning(dfJson <- read.socrata("https://data.healthcare.gov/resource/enx3-h2qp.json?$limit=1000"),
info="https://github.com/Chicago/RSocrata/issues/118")
expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.csv"))
expect_silent(df <- read.socrata("https://odn.data.socrata.com/resource/pvug-y23y.json"))
expect_equal("data.frame", class(dfCsv), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
expect_equal("data.frame", class(dfJson), label="class", info="https://github.com/Chicago/RSocrata/issues/118")
expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
expect_equal(140, ncol(dfJson), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
})
The actual failing test message:
> expect_equal(150, ncol(dfCsv), label="columns", info="https://github.com/Chicago/RSocrata/issues/118")
Error: columns not equal to ncol(dfCsv).
1/1 mismatches
[1] 150 - 146 == 4
https://github.com/Chicago/RSocrata/issues/118
I thought this might be useful, but it didn't help me:
> setdiff(colnames(dfJson), colnames(dfCsv))
[1] "url" "url.1" "url.2" "url.3"
> setdiff(colnames(dfCsv), colnames(dfJson))
[1] "network_url"
[2] "plan_brochure_url"
[3] "summary_of_benefits_url"
[4] "drug_formulary_url"
[5] "adult_dental"
[6] "premium_scenarios"
[7] "standard_plan_cost_sharing"
[8] "X_73_percent_actuarial_value_silver_plan_cost_sharing"
[9] "X_87_percent_actuarial_value_silver_plan_cost_sharing"
[10] "X_94_percent_actuarial_value_silver_plan_cost_sharing"
It looks like the URL columns are different in name only, but the other six columns are missing in the JSON. Not sure if this is related to this issue, or if this is something else?
@geneorama - changing the value to 146, which then lines up the test expectation with the incoming data, removes the error. If there was a problem related to this issue #118, there would still be an error thrown. Therefore, I think the problem is that the number of columns returned by Socrata has changed and that is what caused the issue to fail. As such, nothing I'm seeing shows that there's anything related to #118.
Per @chrismetcalf 's request, I'm posting this here and tagging him to take a look.
I've found that some datasets are returned with a set of response headers that cause
read.socrata()
to error out. I was led to believe the issue has to do with the type of backend (new vs. old) on the Socrata side.See below for full details, but the issue seems to be that the response header
x-soda2-types
isn't getting returned. RSocrata tries to extract the column types from the missing response header, and fails.