censusreporter / census-postgres-scripts

Scripts used to set up census-postgres on an Amazon EC2 instance.
MIT License
65 stars 34 forks source link

Download shell scripts for .tar files forbidden #22

Open RobertSellers opened 5 years ago

RobertSellers commented 5 years ago

This is also somewhat crossposted from the following: https://github.com/aria2/aria2/issues/973. It seems as if wget, curl, and aria2 are forbidden. The .gz extension is also now missing. Any known workarounds to this?

12/20 16:25:03 [ERROR] CUID#8 - Download aborted. URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
Exception: [AbstractCommand.cc:351] errorCode=29 URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
  -> [HttpSkipResponseCommand.cc:231] errorCode=29 The response status is not successful. status=503
iandees commented 5 years ago

I ran into this last year and spoke with some IT folks at Census about it. Apparently they were enforcing some rules about SSL and so required a forged User-Agent and Strict-Transport-Security request headers. This worked last year, but isn't working this year. I think they're also blocking wide ranges of AWS IP addresses.

I got around this temporarily by downloading the files from my home and uploading them to the server doing the data load. I subsequently ran into a couple other problems:

I haven't had a chance to look into these issues yet, which is why Census Reporter hasn't gotten the latest release added yet. I'm hoping to figure it out this weekend.

RobertSellers commented 5 years ago

I appreciate the feedback. Also, yes, I'm running on AWS and haven't tested anywhere else so far.

RobertSellers commented 5 years ago

I can add: the exact same problem occurs from my local PC using Windows 10 linux subsystem with a wget, so this might not be a problem targeted at AWS.

iandees commented 5 years ago

Can you try something that forges the User-Agent header? For example:

wget --debug \
   --header="User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52." \
   --header "Strict-Transport-Security: max-age=31536000" \
   https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar
RobertSellers commented 5 years ago

No luck. It's a wall of 403 errors. uGet Desktop in Windows 10 also isn't working. Yeesh. This data isn't hosted anywhere else in bulk?

loganpowell commented 5 years ago

Hi everyone. I'm sorry to hear you're having issues with this. @iandees with whom did you speak at Census? Can you copy me/forward the email (logan.t.powell@census.gov)?

iandees commented 5 years ago

Hi @loganpowell! I spoke with Jeff Meisel and Lori Carrig last year. I'll forward the email chain.

iandees commented 5 years ago

@loganpowell It seems that your Akamai CDN might be blocking .tar downloads from some user agents? I can use wget on the .zip's ok, but the .tar's are failing.

iandees commented 5 years ago

I was able to get the download working on AWS with this:

aria2c \
    --allow-overwrite=true \
    --auto-file-renaming=false \
    --dir=/mnt/tmp/acs2017_5yr \
    --max-connection-per-server=5 \
    --force-sequential=true \
    --header='Connection: keep-alive' \
    --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' \
    --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
    --header='Accept-Encoding: gzip, deflate, br' \
    --header='Accept-Language: en-US,en;q=0.9' \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/2017_ACS_Geography_Files.zip" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/documentation/user_tools/ACS_5yr_Seq_Table_Number_Lookup.txt"
RobertSellers commented 5 years ago

This seems to be working as required. Thank you for your diligent work on this.

loganpowell commented 5 years ago

@iandees are .tars now cooperating for you?

loganpowell commented 5 years ago

Naive question, do all AWS requests stem from a small set/same IP?

iandees commented 5 years ago

@loganpowell they are, but it sure would be nice to figure out a way to download this data without having to go through all this header trickery. Other parts of the government might call forging these headers fraud 😬.

Requests from AWS come from different IP addresses, but there is a relatively small range of IP addresses and Akamai is probably able to figure them out. My guess that it was an IP block was based on it working from home and not from AWS machines. It's more likely that Census is using some Akamai product to prevent denial of service attacks and it's set to be too restrictive.

loganpowell commented 5 years ago

@iandees I've had this actually happen to me on my own IP (from home using wget for cartography files). I was blacklisted and had to be manually removed from the blacklist. I'm not an expert here, but I believe the problem is when trying to pull a lot of data over the wire very quickly. Have you tried it with some throttling of your requests?

Btw, I'm very happy you figured out a work around. I don't think what you're doing to work around the blacklisting issue would be considered fraud. You're simply doing what is needed to provide a very important public service.