EPA Publication List: NSCEP

hkuchampudi commented 7 years ago

EPA Publication List

Agency: Environmental Protection Agency
Agency Division: National Service Center for Environmental Publications (NSCEP)
Data Type: Various
Data Format: PDF

I have mined the metadata for the EPA Publication List and have hosted the direct download links to the PDFs in my repository. I need help mining the documents themselves as I do not have the space to download them.

Downloading the Documents

You can execute the following command (after downloading the directLinks.txt file) replacing the placeholders with the appropriate values to download files in bulk:

awk 'FNR>=[Starting_Line_Number] && FNR<=[Ending_Line_Number]' [Links_Location] | while read -r link; do wget -t 10 -T 10 -U "Mozilla" $(echo $link | tr -d '\r'); done

[Starting_Line_Number] with the line number of the first link to download
[Ending_Line_Number] with the line number of the last link to download
[Links_Location] with the path to the downloaded directLinks.txt file from the repository

Download Information

Property	Value
Number links/documents	75973
Estimated total filesize	16.386551273 GB

kimmerin commented 7 years ago

The server returns with 301 (moved permamently) without a target if you don't provide a User-Agent. I started a download-session with the following command:

awk 'FNR>=1 && FNR<=75973' ./directLinks.txt | while read -r link; do wget -t 10 -T 10 -U "Modzilla" -a wgetout.txt -nv $(echo $link | tr -d '\r'); done

kimmerin commented 7 years ago

After 90 files I get destination-less redirects again, also for files that were successfully downloaded before. Will try again tomorrow.

hkuchampudi commented 7 years ago

@kimmerin Good catch about the user agent :), it was a copy paste mistake on my part. The server is notoriously slow with frequent outages, so I am not too surprised with the results you are having.

kimmerin commented 7 years ago

It's not an outage. I was able to access the PDFs using Firefox while the download via wget failed. Both tests happened on the same computer. Have to try that today on the computer I started the download when I'm sitting in front of it including a GUI and not only have SSH-access.

JeremiahCurtis commented 7 years ago

2 questions: 1) does this list include the Transportation and Air Quality Documents available at https://iaspub.epa.gov/otaqpub/publist1.jsp? If not, the next question is 2) Is anyone able to grab the OTAQ documents quickly? From what I saw, there are over 26,000 URLs, and all the URL generators I used top out at 1000 links

hkuchampudi commented 7 years ago

@JeremiahCurtis No this data dump does not. It focuses on the National Service Center for Environmental Publications (NSCEP) (I have just added this to the issue documentation). As for the OTAQ documents, I can take a look at it and see if I can mine the metadata URL links for the PDFs. However, I would recommend opening up an issue for these documents if there isn't already existing issue covering them.

hkuchampudi commented 7 years ago

@kimmerin I tested out the script and also experienced the same issue as you with the redirect and moved permanently errors after X number of files successfully downloading. I isolated an example link that resulted in a 301 and tried to navigate to and download the PDF as a user would from the website. When trying to download the PDF, the website spits out the same download link that wget was trying to download from and also returns a blank page. Therefore, I would conclude either a temporary service outage on the EPA's part or an IDS/IPS system kicking in.

kimmerin commented 7 years ago

@hkuchampudi No, it's some kind of block. I tried again today and I've got two systems that go out to the internet via different providers (i.e. arrive at the site with different IPs). While the one that stopped working yesterday kept receiving empty 301s, the other system happily downloaded (again) 90 files and received empty 301s afterwards. So I assume some kind of blacklist to prevent the very thing we're trying to do here.

hkuchampudi commented 7 years ago

@kimmerin Thanks for the additional information! Hmmm... that's perplexing. I will take a look at it to see if there's anything I can do to circumvent the issue. However, for the time being, we may need some more help so we can chunk and download links.

hkuchampudi commented 7 years ago

@JeremiahCurtis I have mined the metadata and have posted information on how to download the data in issue #360. Please take a look if you are interested in helping out.

kimmerin commented 7 years ago

For what it's worth, I changed the download script a bit to check the existance of a file before trying to download it. That way the script can be restarted without the need to change the start-line setting within the script

awk 'FNR>=281 && FNR<=75973' ./directLinks.txt | while read -r link; do
  FILENAME=`echo $link | sed 's/^.*\///'`
  echo $FILENAME
  if [ ! -e $FILENAME ]; then
    echo $link
    wget -t 10 -T 10 -U "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" -nv $(echo $link | tr -d '\r')
    sleep 60
  fi
done

kimmerin commented 7 years ago

BTW. The sleep 60 in the script was my attempt to keep traffic to the server down to check if that prevents the IP-based block. It doesn't. After 100 downloads (the previous 90 were due to the fact that I did some "dry-runs" with single files) the IP gets blocked for further downloads.

h1z1 commented 7 years ago

Just a thought but doing one file at a time is actually harder on the server as you're opening one connection per file. I'd dump it to xargs or parallel and feed wget a URL list. You can still wait 60 seconds between requests (-w) or better / IDS friendlier --random-wait. Aria2 is a bit better in that regard.

climate-mirror / datasets