c-w / gutenberg

A simple interface to the Project Gutenberg corpus.
Apache License 2.0
320 stars 60 forks source link

Download URI for {} not supported #81

Closed anjabeth closed 7 years ago

anjabeth commented 7 years ago

This error was mentioned in Issue #23 , which was closed, but I'm running into the same error for what seems to be a different reason, so I thought I would create a new one.

I'm having trouble with same error as in #23 for The Iliad here: https://www.gutenberg.org/ebooks/6130

File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 72, in load_etext
    download_uri = _format_download_uri(etextno)
  File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 56, in _format_download_uri
    raise ValueError('download URI for {0} not supported'.format(etextno))
ValueError: download URI for 2000 not supported

But this text doesn't seem to be super recent - it was uploaded in 2004. To check and make sure that the errordidn't happen for all books, I tried a couple of others. http://www.gutenberg.org/ebooks/16452, another version of The Iliad, has the same URI error, as does Welsh Fairy Tales (http://www.gutenberg.org/ebooks/9368). Books with bibrec numbers 7250, 5000, 3000, 2000, and 1300 all have the same error.

On The Origin of Species, http://www.gutenberg.org/ebooks/1228, works just fine, but The First Part of Henry the Sixth (http://www.gutenberg.org/ebooks/1100), which is chronologically before On The Origin of Species, raises the URI error as well, which makes it seem unlikely to me that it's an issue with recency of upload (as in #23)

Any idea what the issue might be?

MasterOdin commented 7 years ago

Looks like the mirror that we use (http://www.gutenberg.lib.md.us/) is now returning a "Forbidden" error. Not sure if this is something that will resolve itself soon, or if we should move to another mirror immediately.

For right now, you can either wait to see if the mirror we use comes back online or edit this line to use a http mirror from this list.

@c-w: I defer to you on which you think would be better. My vote would be to move to aleph.gutenberg.org which is a mirror that Project Gutenberg maintains which should have less issues than 3rd party ones (hopefully).

anjabeth commented 7 years ago

Thanks! I was getting that "forbidden" error, but figured it was just because I didn't have permission. I'll try another mirror.

c-w commented 7 years ago

@MasterOdin: Sounds good. Let's move to another mirror.

Some comments: 1) We should also make the mirror base-url configurable via an environment variable so that users can fix future mirror problems without having to wait for an update of the library. 2) Is there a way in which we can catch mirror-related issues and re-throw a more descriptive error, e.g. instructing the user to update their mirror base-url?

c-w commented 7 years ago

I'll look into this before the end of the week.

MasterOdin commented 7 years ago

I think I've already got this ready for review which I'll push tomorrow.

  1. I added a check for "GUTENBERG_MIRROR" environment variable on that line, else fall back to aleph.gutenberg.org mirror.

  2. On the first usage of the text._format_download_uri, check the mirror link and if the response is not ok, throw an exception.

Sample output using the current (broken) mirror:

python3 -m gutenberg.acquire.text 2701 moby-raw.txt                                                                                                          [20:06:00]
INFO:rdflib:RDFLib Version: 4.2.2
usage: text.py [-h] etextno outfile
text.py: error: Could not reach gutenberg mirror. Try setting a different mirror (https://www.gutenberg.org/MIRRORS.ALL) for GUTENBERG_MIRROR environment variable.

I'll open a PR tomorrow for this.

c-w commented 7 years ago

Awesome!