Provides easier interaction with Socrata open data portals http://dev.socrata.com. Users can provide a 'Socrata' data set resource URL, or a 'Socrata' Open Data API (SoDA) web query, or a 'Socrata' "human-friendly" URL, returns an R data frame. Converts dates to 'POSIX' format. Manages throttling by 'Socrata'.
This pull request introduces a stable version of an export.socrata() function as outlined in #126. This allows users to download the contents of a data portal to a local directory. This function will download CSVs (compressed), PDFs, Word, Excel, PowerPoints, GeoJSON, Shapefiles, plain text documents (uncompressed), etc. It will not download HTML pages. As part of the process, the function also copies the data.json file to act as an index for other downloaded files.
I've proposed the version as 1.8.0.
Testing portal export
To test this function, I used the City of Norfolk, VA to export all of the data sets. Looking at their data.json file, I counted 32 data sets that were not HTML pages or did not have a downloadable file. Executing export.socrata("https://data.norfolk.gov) resulted in 32 downloaded files plus the copy of the data.json file. Thus, the expected number of files match the actual number of downloaded files.
Testing non-CSV documents
All of the testing for Norfolk resulted in compressed CSV files, however, also needed to test the ability to download non-CSV files. Kansas City, Missouri's data portal has an unusually large number of non-CSV data sets on their portal, such as PDFs, word documents, Excel documents, etc.
I tested the function on downloading files from their data portal. The function downloaded PDFs, Words, Excel, and other non-CSV files along with CSV files.
However, I did encounter frequent network timeouts after approximately 80 items were downloaded. I believe this is limited to the network and not an issue with the function itself. While this may not be a bug, it may be a limitation on the ability to export files from Socrata.
Unit Testing
I have not written a unit test. I think any unit test will take too much time and space for typical unit testing. The smallest portal download, Norfolk, elapsed over 30 minutes to complete all downloads.
In general, a recommended method for testing is to choose a reasonably small portal and do the following:
Export all files from the portal.
When finished, open the data.json file and count all of the entries with the following exceptions:
distribution/mediaType is blank
distribution/mediaType is text/html
distribution/downloadURL is blank
Compare the counts of download files (except the data.json file) and the count from step (2).
Ideally, the portal being used to test contains CSV files as well as non-CSV files.
This pull request introduces a stable version of an
export.socrata()
function as outlined in #126. This allows users to download the contents of a data portal to a local directory. This function will download CSVs (compressed), PDFs, Word, Excel, PowerPoints, GeoJSON, Shapefiles, plain text documents (uncompressed), etc. It will not download HTML pages. As part of the process, the function also copies thedata.json
file to act as an index for other downloaded files.I've proposed the version as 1.8.0.
Testing portal export
To test this function, I used the City of Norfolk, VA to export all of the data sets. Looking at their data.json file, I counted 32 data sets that were not HTML pages or did not have a downloadable file. Executing
export.socrata("https://data.norfolk.gov)
resulted in 32 downloaded files plus the copy of thedata.json
file. Thus, the expected number of files match the actual number of downloaded files.Testing non-CSV documents
All of the testing for Norfolk resulted in compressed CSV files, however, also needed to test the ability to download non-CSV files. Kansas City, Missouri's data portal has an unusually large number of non-CSV data sets on their portal, such as PDFs, word documents, Excel documents, etc.
I tested the function on downloading files from their data portal. The function downloaded PDFs, Words, Excel, and other non-CSV files along with CSV files.
However, I did encounter frequent network timeouts after approximately 80 items were downloaded. I believe this is limited to the network and not an issue with the function itself. While this may not be a bug, it may be a limitation on the ability to export files from Socrata.
Unit Testing
I have not written a unit test. I think any unit test will take too much time and space for typical unit testing. The smallest portal download, Norfolk, elapsed over 30 minutes to complete all downloads.
In general, a recommended method for testing is to choose a reasonably small portal and do the following:
distribution/mediaType
is blankdistribution/mediaType
istext/html
distribution/downloadURL
is blankIdeally, the portal being used to test contains CSV files as well as non-CSV files.