arrrlo / Google-Images-Search

[PYTHON] Search for image using Google Custom Search API and resize & crop afterwards
MIT License
176 stars 34 forks source link

Search randomly fails with Http error. #149

Closed DragonflyRobotics closed 1 year ago

DragonflyRobotics commented 2 years ago

I am trying to download 200 images of a given object. Here is the configuration of the search header:

_search_params = {
    'q': keyword,
    'num': quantity,
    # 'fileType': 'jpg',
    # 'rights': 'cc_nonderived',
    # 'safe': 'medium',  ##
    'imgType': 'photo',  ##
    'imgSize': 'imgSizeUndefined',  ##
    'imgDominantColor': 'imgDominantColorUndefined',
    ##
    'imgColorType': 'imgColorTypeUndefined'  ##
}

Here is the error it throws:

googleapiclient.errors.HttpError: <HttpError 400 when requesting https://customsearch.googleapis.com/customsearch/v1?cx=***&q=door&searchType=image&num=10&start=201&imgType=photo&imgSize=imgSizeUndefined&safe=off&imgDominantColor=imgDominantColorUndefined&imgColorType=imgColorTypeUndefined&key=***&alt=json returned "Request contains an invalid argument.". Details: "Request contains an invalid argument.">

I am not sure what is wrong and I need this to work reliably. Is there a way I can just catch the error and move on? It downloads like 145 images and then just chokes. The thing is that it chokes after downloading exactly 145 images.

Note: I intentionally censored the CX ID and the API key. Those are replaced with the correct ones on the real code.

DragonflyRobotics commented 2 years ago

I tried a different search term and it exited after downloading 117 images.

arrrlo commented 2 years ago

Hi @DragonflyRobotics

This is now well known limitation from Google search API when sum of start and num query parameters is bigger then 100: https://developers.google.com/custom-search/v1/reference/rest/v1/cse/list

Frankly, I don't know how to tackle this except by making a friendly exception or warning or something simillar.

Screenshot 2022-06-02 at 09 18 14

DragonflyRobotics commented 2 years ago

Somehow, I was able to download 100 images easily. It choked after 110 or 120 images.

arrrlo commented 2 years ago

Yes, that limit is a pain. Will investigate this further.

DragonflyRobotics commented 2 years ago

I will also try researching and assisting with this issue. I found your repo incredibly useful in my project.

DragonflyRobotics commented 2 years ago

I have been messing around with GIS some more. I found that it doesn't stop at exactly at 100. Furthermore, it downloads more images for some keywords and less for others. I think it might not have to do with the Google download cap.

arrrlo commented 2 years ago

Hi @DragonflyRobotics

Not all images out there are valid and good to download. A lot of them are plain unreachable, producing error 4xx and higher. That is why some of the keywords download more and some less images because this lib validates its availability prior to downloading.

There is nothing more to this lib. If it wasn't for this Google API's limit, this lib would download thousand images without stopping.

And it stops with "Request contains an invalid argument." error by Google, using the same arguments as before the error.

I've tested it again now with num=200, and looks like the start + num > 100 limit doesn't work at all. API goes beyond 100 limit point just fine. But once a start argument surpasses 200, you get the "Request contains an invalid argument." and the invalid part is the start argument being bigger than 200.

There is no other explanation. Nothing else changes from request to request.

karencfisher commented 1 year ago

If there is a hard limit in the Google API of the start argument being <= 200, maybe simply return when that limit is exceeded before making the new request? It is kind of a downer of course, when you just get what you can, I know. It's better though than having it throwing an exception.

I am looking to search through batches of different images, so I would rather not have the process crash out (though I do plan to handle the exception in my code and move on to the next query in the queue I guess.)

DragonflyRobotics commented 1 year ago

I think that is a good idea. We can simply programmatically run until the <200 flag is reached. Then we can just stop the search instance, make a new one, and continue downloading.

arrrlo commented 1 year ago

The problem here is you simply cannot get more than 200 different images with one search query. When you reach start + num > 200, game's over. Use different query term. That's not my rule, it's Google's.

And I don't thing silent fail in that case is a good idea. Everyone should be aware of this limit and handle it for them selfs.

The problem is if the last query has parameters like start=193 and num=5, which goes beyond 200 limit, it will fail before getting any image. So my idea is when that happen, to correct num parameter in a way not to go beyond 200 when summed with start param, and throw an exception afterwards. In that case you are aware of the limitation and have your images as well.

And you code should look like this:

from google_images_search.exception import GoogleLimit

try:
    gis.search(...)
except GoogleLimit:
    pass

for image in gis.results():
    pass