Open hkuchampudi opened 7 years ago
The server returns with 301 (moved permamently) without a target if you don't provide a User-Agent. I started a download-session with the following command:
awk 'FNR>=1 && FNR<=75973' ./directLinks.txt | while read -r link; do wget -t 10 -T 10 -U "Modzilla" -a wgetout.txt -nv $(echo $link | tr -d '\r'); done
After 90 files I get destination-less redirects again, also for files that were successfully downloaded before. Will try again tomorrow.
@kimmerin Good catch about the user agent :), it was a copy paste mistake on my part. The server is notoriously slow with frequent outages, so I am not too surprised with the results you are having.
It's not an outage. I was able to access the PDFs using Firefox while the download via wget failed. Both tests happened on the same computer. Have to try that today on the computer I started the download when I'm sitting in front of it including a GUI and not only have SSH-access.
2 questions: 1) does this list include the Transportation and Air Quality Documents available at https://iaspub.epa.gov/otaqpub/publist1.jsp? If not, the next question is 2) Is anyone able to grab the OTAQ documents quickly? From what I saw, there are over 26,000 URLs, and all the URL generators I used top out at 1000 links
@JeremiahCurtis No this data dump does not. It focuses on the National Service Center for Environmental Publications (NSCEP) (I have just added this to the issue documentation). As for the OTAQ documents, I can take a look at it and see if I can mine the metadata URL links for the PDFs. However, I would recommend opening up an issue for these documents if there isn't already existing issue covering them.
@kimmerin I tested out the script and also experienced the same issue as you with the redirect and moved permanently errors after X number of files successfully downloading. I isolated an example link that resulted in a 301 and tried to navigate to and download the PDF as a user would from the website. When trying to download the PDF, the website spits out the same download link that wget was trying to download from and also returns a blank page. Therefore, I would conclude either a temporary service outage on the EPA's part or an IDS/IPS system kicking in.
@hkuchampudi No, it's some kind of block. I tried again today and I've got two systems that go out to the internet via different providers (i.e. arrive at the site with different IPs). While the one that stopped working yesterday kept receiving empty 301s, the other system happily downloaded (again) 90 files and received empty 301s afterwards. So I assume some kind of blacklist to prevent the very thing we're trying to do here.
@kimmerin Thanks for the additional information! Hmmm... that's perplexing. I will take a look at it to see if there's anything I can do to circumvent the issue. However, for the time being, we may need some more help so we can chunk and download links.
@JeremiahCurtis I have mined the metadata and have posted information on how to download the data in issue #360. Please take a look if you are interested in helping out.
For what it's worth, I changed the download script a bit to check the existance of a file before trying to download it. That way the script can be restarted without the need to change the start-line setting within the script
awk 'FNR>=281 && FNR<=75973' ./directLinks.txt | while read -r link; do
FILENAME=`echo $link | sed 's/^.*\///'`
echo $FILENAME
if [ ! -e $FILENAME ]; then
echo $link
wget -t 10 -T 10 -U "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0" -nv $(echo $link | tr -d '\r')
sleep 60
fi
done
BTW. The sleep 60 in the script was my attempt to keep traffic to the server down to check if that prevents the IP-based block. It doesn't. After 100 downloads (the previous 90 were due to the fact that I did some "dry-runs" with single files) the IP gets blocked for further downloads.
Just a thought but doing one file at a time is actually harder on the server as you're opening one connection per file. I'd dump it to xargs or parallel and feed wget a URL list. You can still wait 60 seconds between requests (-w) or better / IDS friendlier --random-wait. Aria2 is a bit better in that regard.
EPA Publication List
I have mined the metadata for the EPA Publication List and have hosted the direct download links to the PDFs in my repository. I need help mining the documents themselves as I do not have the space to download them.
Downloading the Documents
You can execute the following command (after downloading the directLinks.txt file) replacing the placeholders with the appropriate values to download files in bulk:
Download Information