Open tomschenkjr opened 7 years ago
Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields.
Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk?
@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself.
@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release.
In either case, I think read.socrata()
could be modified to only extract the column names and not the data. Whether that's through the same SoDA API calls or through the Discovery/Catalog API is probably more a technical/optimization question.
Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one.
Outlining some thoughts on unit tests for this function:
ls.socrata()
.FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:
library(jsonlite)
API <- 'http://api.us.socrata.com/api/catalog/v1?only=datasets&domains=data.cambridgema.gov'
RawMetadata <- fromJSON(txt=API)
Metadata <- RawMetadata[['results']]
Metadata <- Metadata$resource
This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns).
Looks interesting and definitely interested in incorporating the Discovery API (i.e., #114).
Would this fit in line with #128?
Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore.
I've pushed the branch to the repo: issue126
.
/cc @nicklucius @geneorama
I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the httr::GET()
contents are written straight to a file and read.socrata()
is not called. The original filename and proper file extension is appended to the filename.
For geographical data with multiple download choices, it looks like KML is first so that is what is being saved.
Still to do:
@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works.
The compression is surprisingly tricky.
Scratch the geojson remark, that's certainly taken care of. I'll check that off, too.
Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it.
> export.socrata("https://data.cityofchicago.org/")
Hide Traceback
Rerun with Debug
Error in file(con, "wb") : cannot open the connection
3. file(con, "wb")
2. writeBin(response$content, filename)
1. export.socrata("https://data.cityofchicago.org/")
In addition: Warning message:
In file(con, "wb") :
cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument
Ok - I've fixed the above error and also added the app_token
parameter.
Later today, can move this over to dev
branch to be part of the nightly build.
As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal.
This is a bit of work so don't necessarily like it, but is one option.
It's now on dev
branch.
Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on.
Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links (/api/views/$id/rows.$format
) or the SODA2 APIs (/resource/$id.$format
)?
The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you.
@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now.
The export.socrata()
function uses the export links that look like /api/views/$id/rows.$format
to download data. Is this the faster method? If so, I wonder if that is something to keep in mind for read.socrata()
development.
Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching.
Documenting two ideas from the analytics meeting today:
ls.socrata
results so that we have the information about the data sets in the future.Not sure if this error message should be a concern?
export.socrata("https://opendata.cheshireeast.gov.uk/")
Warning messages:
1: In no_deniro(result[[columnName]]) : NAs introduced by coercion
2: In no_deniro(result[[columnName]]) : NAs introduced by coercion
3: In no_deniro(result[[columnName]]) : NAs introduced by coercion
It appears to have extracted some files, but I'm not sure it is the complete set of public files.
@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly.
The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json
@James-SR - The latest build of export.socrata
has been pushed to the issue126
branch. I tested Cheshire East portal and did not receive the error. This could be because we fixed it or the offending data set or element was changed on the portal itself.
Feel free to test this again and hope it is still useful.
There is quirk with some data listed in Socrata that I wanted to document. Socrata supports several HTML-based "non-data" formats in the data.json
file--which we use to list all available data through ls.socrata()
. First, Socrata has "Stories" which let users create HTML sites (example). In my recent commit, I've opted to simply skip these files.
Second, Socrata also supports "external data". Sometimes, these external data are links to HTML webpages while other times they link to actual data. For instance, this dataset is linked to a web page not hosted by Socrata.
Our function simply does not handle this well because HTML websites do not have content disposition. I've made a change to ignore links that do not have a content disposition to ignore these kind of sites. Sometimes external data is actually data (e.g., CSV), so those should still be downloaded. Other content, e.g., HTML, will be skipped.
The downside of these approaches is it will ignore some information and could cause confusion. The former scenario is easy to determine because the data.json
does not list a distributionUrl
for Stories. However, the latter case is more complicated because external data always displays a distributionUrl
, even if it's to HTML-based sites.
This also makes more difficult to write unit tests. For instance, this approach means the data.json
file is not a one-for-one correspondence to data actually downloaded. There isn't an indicator in data.json
for what is ultimately downloaded.
I will think of ways to resolve these issues.
I've resolved the issue by having the function follow these rules:
mediaType
is text/csv
)downloadUrl
(usually because there simply isn't data)mediaType
is text/html
)GET
command.
The
ls.socrata()
function supports the listing of all data on a data portal whileread.socrata()
. Thus, the two can be combined underexport.socrata()
to download all of the files a neatly compress and place them in a single directory.This sort of functionality can be used by people wishing to archive data portals or help in the migration of one platform to another one. The function should be focused on saving the data to a local or cloud-based storage (e.g., S3) and should avoid loading all of the data into memory.
Solving #124 will allow an easier integration between
ls.socrata()
andread.socrata()
, so RSocrata v1.7.2-7 or above will be required.data.cityofchicago.org/
)export.socrata()
to avoid throttlingdata.frame
friendly, but JSON downloads faster from Socrata)An initial alpha is on this gist.This is now on theissue126
branch. Feedback is encouraged as I've done limited testing at this point.