Nonprofit-Open-Data-Collective / irs-efile-master-concordance-file

The Master Concordance File defines standards and provides documentation necessary to build structured databases from the IRS E-File XML files posted on AWS.
https://nonprofit-open-data-collective.github.io/irs-efile-master-concordance-file/
40 stars 6 forks source link

Error in Script #33

Open ats1958 opened 6 years ago

ats1958 commented 6 years ago

Has anyone successfully downloaded all data recently? Getting the following error:

Exception in thread "main" java.lang.RuntimeException: java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: All access to this object has been disabled (Service: Amazon S3; Status Code: 403; Error Code:

lecy commented 6 years ago

I just did a test-run and I was able to execute this R code without problem.


library( jsonlite )
library( R.utils )

# CREATE A DATA FRAME OF ELECTRONIC FILERS FROM IRS JSON FILES

dat1 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2011.json")[[1]]
dat2 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2012.json")[[1]]
dat3 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2013.json")[[1]]
dat4 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2014.json")[[1]]
dat5 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2015.json")[[1]]
dat6 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2016.json")[[1]]
dat7 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2017.json")[[1]]

efiler.index <- rbind( dat1, dat2, dat3, dat4, dat5, dat6, dat7 )

head( efiler.index )

library( xml2 )
library( dplyr )

### EXAMPLE ORGANIZATIONS FROM EACH PERIOD

V_990_2014 <- "https://s3.amazonaws.com/irs-form-990/201543089349301829_public.xml"

V_990_2012 <- "https://s3.amazonaws.com/irs-form-990/201322949349300907_public.xml"

V_990EZ_2014 <- "https://s3.amazonaws.com/irs-form-990/201513089349200226_public.xml"

V_990EZ_2012 <- "https://s3.amazonaws.com/irs-form-990/201313549349200311_public.xml"

### GENERATE ALL XPATHS: V 990 2014
doc <- read_xml( V_990_2014 )
xml_ns_strip( doc )
doc %>% xml_find_all( '//*') %>% xml_path()

### GENERATE ALL XPATHS: V 990 2012
doc <- read_xml( V_990_2012 )
xml_ns_strip( doc )
doc %>% xml_find_all( '//*') %>% xml_path()

### GENERATE ALL XPATHS: V 990EZ 2014
doc <- read_xml( V_990EZ_2014 )
xml_ns_strip( doc )
doc %>% xml_find_all( '//*') %>% xml_path()

### GENERATE ALL XPATHS: V 990EZ 2012
doc <- read_xml( V_990EZ_2012 )
xml_ns_strip( doc )
doc %>% xml_find_all( '//*') %>% xml_path()
borenstein commented 6 years ago

Depending on how you're authenticating, you may be unable to get S3 data via the s3:// protocol, while having no trouble downloading it anonymously via https. When that happens, it's usually something about permissions on your side--either your IAM role is too restrictive, or the client that you're using to talk to S3 doesn't see your credentials. Do try to get it working via S3, however--batch downloads via S3 are orders of magnitude faster than individual https requests, which each require separate handshakes between your machine and Amazon.

-- David Bruce Borenstein, PhD 781.710.2789 (m) https://www.linkedin.com/in/davidborenstein

On Mon, Apr 23, 2018 at 11:04 AM, Jesse Lecy notifications@github.com wrote:

I just did a test-run and I was able to execute this R code without problem.

library( jsonlite ) library( R.utils )

CREATE A DATA FRAME OF ELECTRONIC FILERS FROM IRS JSON FILES

dat1 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2011.json")[[1]]dat2 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2012.json")[[1]]dat3 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2013.json")[[1]]dat4 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2014.json")[[1]]dat5 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2015.json")[[1]]dat6 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2016.json")[[1]]dat7 <- fromJSON("https://s3.amazonaws.com/irs-form-990/index_2017.json")[[1]] efiler.index <- rbind( dat1, dat2, dat3, dat4, dat5, dat6, dat7 )

head( efiler.index )

library( xml2 ) library( dplyr )

EXAMPLE ORGANIZATIONS FROM EACH PERIOD

V_990_2014 <- "https://s3.amazonaws.com/irs-form-990/201543089349301829_public.xml" V_990_2012 <- "https://s3.amazonaws.com/irs-form-990/201322949349300907_public.xml" V_990EZ_2014 <- "https://s3.amazonaws.com/irs-form-990/201513089349200226_public.xml" V_990EZ_2012 <- "https://s3.amazonaws.com/irs-form-990/201313549349200311_public.xml"

GENERATE ALL XPATHS: V 990 2014doc <- read_xml( V_990_2014 )

xml_ns_strip( doc )doc %>% xml_find_all( '//*') %>% xml_path()

GENERATE ALL XPATHS: V 990 2012doc <- read_xml( V_990_2012 )

xml_ns_strip( doc )doc %>% xml_find_all( '//*') %>% xml_path()

GENERATE ALL XPATHS: V 990EZ 2014doc <- read_xml( V_990EZ_2014 )

xml_ns_strip( doc )doc %>% xml_find_all( '//*') %>% xml_path()

GENERATE ALL XPATHS: V 990EZ 2012doc <- read_xml( V_990EZ_2012 )

xml_ns_strip( doc )doc %>% xml_find_all( '//*') %>% xml_path()

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file/issues/33#issuecomment-383608422, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPgnzPhH8Z5Nz4LOKWGHqnmbC847M68ks5tre2PgaJpZM4Tf7kv .