deepanprabhu / duckduckgo-images-api

DuckDuckGo Image Search Resuts - Programatically download Image Search Results
The Unlicense
82 stars 24 forks source link

I’ve implemented max_n as well as checked to make sure URLs are unique #12

Open prairie-guy opened 4 years ago

prairie-guy commented 4 years ago

I really appreciate the hard work you have done getting images from the Duck Duck Go search engine. In order to better understand it, I rewrote it in my own style. I implemented max_n as well as checked for duplicate URLs. (Turns out that of 650 images, 20-30 might be duplicates.) I also wanted code that would output in a format that could be used for other code to download the images. That is why I stripped out logging. I also fixed the code so that it could be imported into python or else used at the command line. Rather than a Pull Request, I post it here to see if you want to consider it or not:


### image_search_ddg.py                                                                                                                               
### C. Bryan Daniels                                                                                                                                  
### 9/1/2020                                                                                                                                          
### Adopted from https://github.com/deepanprabhu/duckduckgo-images-api                                                                                
###                                                                                                                                                   

import requests, re, json, time, sys

headers = {'authority':'duckduckgo.com','accept':'application/json,text/javascript,*/*; q=0.01','sec-fetch-dest':'empty',
        'x-requested-with':'XMLHttpRequest',
        'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/80.0.3987.163 Safari/537.36',
        'sec-fetch-site':'same-origin','sec-fetch-mode':'cors','referer':'https://duckduckgo.com/','accept-language':'en-US,en;q=0.9'}

def image_search_ddg(keywords,max_n=100):
    """Search for 'keywords' with DuckDuckGo and return a unique urls of 'max_n' images"""
    url = 'https://duckduckgo.com/'
    params = {'q':keywords}
    res = requests.post(url,data=params)
    searchObj = re.search(r'vqd=([\d-]+)\&',res.text)
    if not searchObj: print('Token Parsing Failed !'); return
    params = (('l','us-en'),('o','json'),('q',keywords),('vqd',searchObj.group(1)),('f',',,,'),('p','1'),('v7exp','a'))
    requestUrl = url + 'i.js'
    urls = []
    while True:
        try:
            res = requests.get(requestUrl,headers=headers,params=params)
            data = json.loads(res.text)
            for obj in data['results']:
                urls.append(obj['image'])
                max_n = max_n - 1
                if max_n < 1: return print_uniq(urls)
            if 'next' not in data: return print_uniq(urls)
            requestUrl = url + data['next']
        except:
            pass

def print_uniq(urls):
    for url in set(urls):
        print(url)

if __name__ == "__main__": 
    if len(sys.argv)    == 2: image_search_ddg(sys.argv[1])
    elif len(sys.argv)  == 3: image_search_ddg(sys.argv[1],int(sys.argv[2]))
    else: print("usage: search(keywords,max_n=100)")
deepanprabhu commented 4 years ago

Thank you @prairie-guy