NikolaiT / GoogleScraper

A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
https://scrapeulous.com/
Apache License 2.0
2.6k stars 734 forks source link

#33 issue, Which selector_class to use should be judge #37

Closed leadscloud closed 9 years ago

leadscloud commented 9 years ago

33 issue

selector_dict = {
    'results': {
        'us_ip': {
            'container': '#b_results',
            'result_container': '.b_algo',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > .b_attribution > p::text',
            'title': 'h2::text',
            'visible_link': 'cite::text'
        },
        'de_ip': {
            'container': '#b_results',
            'result_container': '.b_algo',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > p::text',
            'title': 'h2::text',
            'visible_link': 'cite::text'
        }
    },
    'ads_main': {
        'us_ip': {
            'container': '#b_results .b_ad',
            'result_container': '.sb_add',
            'link': 'h2 > a::attr(href)',
            'snippet': '.sb_addesc::text',
            'title': 'h2 > a::text',
            'visible_link': 'cite::text'
        },
        'de_ip': {
            'container': '#b_results .b_ad',
            'result_container': '.sb_add',
            'link': 'h2 > a::attr(href)',
            'snippet': '.b_caption > p::text',
            'title': 'h2 > a::text',
            'visible_link': 'cite::text'
        }
    }
}
for result_type, selector_class in selector_dict.items():
    for selector_specific, selectors in selector_class.items():

Because results have two item, us_ip and de_ip, then, serp_result will be have double result. per 10 item use 'snippet': '.b_caption > p::text' and per 10 item use 'snippet': '.sb_addesc::text', but in china, 'snippet': '.sb_addesc::text' have no snippet, the result shoube

{'ads_main': [{'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': None,
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': None,
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://3298057.r.msn.com/?ld=d3OD_VLpTKI-kMJ0dF_CITFjVUCUxN0FalWnZMSIt3e7v5R19iISY77aojSiFe6ICKt_Glrsm9zefr15xonxtlypbOSfJY40JpRjtqLly5PaXmtjTwPO6DXFoVcJ-f0Vxl_fWD8pHGWWYwGRaiOxFoPF2_8AU&u=ipasshortcut.com%2f%3fid%3d5999%26tid%3dpro',
               'snippet': None,
               'title': 'Direct Sales Marketing | breakthroughmastermind.com',
               'visible_link': 'http://breakthroughmastermind.com'},
              {'link': 'http://2482071.r.msn.com/?ld=d3FaCavjiziXjBQh-qAZr74zVUCUzVXXesnDZKVdzwgfz3UKNj8WBli3-mU_uKnqAfyu2GPpArwAvi3NkoBAmE0U5pjoej_X8YS9efyHgNzo4KvJH1c-YGf3xzSoD-JiCkfYWYxU1Dv6Y1PYCZLVk-vQjJPp4&u=list.qoo10.sg%2fgmkt.inc%2fCategory%2fGroup.aspx%3fg%3d10%26jaehuid%3d2000149996',
               'snippet': None,
               'title': 'Best e-Ticket Deals | Qoo10.sg',
               'visible_link': 'www.Qoo10.sg'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': None,
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': None,
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': 'Over 60 Million Visitors.',
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': 'Bring 1,000 premium targeted visitors to your website for $7.95',
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'},
              {'link': 'http://3298057.r.msn.com/?ld=d3OD_VLpTKI-kMJ0dF_CITFjVUCUxN0FalWnZMSIt3e7v5R19iISY77aojSiFe6ICKt_Glrsm9zefr15xonxtlypbOSfJY40JpRjtqLly5PaXmtjTwPO6DXFoVcJ-f0Vxl_fWD8pHGWWYwGRaiOxFoPF2_8AU&u=ipasshortcut.com%2f%3fid%3d5999%26tid%3dpro',
               'snippet': 'Discover How To Make Your First $3,000 A Month With This Proven System!',
               'title': 'Direct Sales Marketing | breakthroughmastermind.com',
               'visible_link': 'http://breakthroughmastermind.com'},
              {'link': 'http://2482071.r.msn.com/?ld=d3FaCavjiziXjBQh-qAZr74zVUCUzVXXesnDZKVdzwgfz3UKNj8WBli3-mU_uKnqAfyu2GPpArwAvi3NkoBAmE0U5pjoej_X8YS9efyHgNzo4KvJH1c-YGf3xzSoD-JiCkfYWYxU1Dv6Y1PYCZLVk-vQjJPp4&u=list.qoo10.sg%2fgmkt.inc%2fCategory%2fGroup.aspx%3fg%3d10%26jaehuid%3d2000149996',
               'snippet': 'USS, SEA Aquarium, Batam & a lot more awesome deals!',
               'title': 'Best e-Ticket Deals | Qoo10.sg',
               'visible_link': 'www.Qoo10.sg'},
              {'link': 'http://2413684.r.msn.com/?ld=d3ruRnTwsPmIaUls4aKL--NjVUCUwrVYiq1RZFM9IFMBK7NWB-VE_xchEIW6-kApI8yQTwbqgY9lCh4N2avp9OGntqJvaeKM425XlnNiZn6iFU6Fageo0NS1hMQrKO8AQ0Q3N0SI8hPQRdQPYO5FdEJgA10_g&u=http%3a%2f%2findex.about.com%2fslp%3f%26q%3dbest%2bseo%2btools%26sid%3d9b3473ca-1503-47a7-9e0b-2a013d5accd7-0-ab_msb%26kwid%3dbest%2520seo%2520tools%26cid%3d3906103690',
               'snippet': 'Over 60 Million Visitors.',
               'title': 'Best Seo Tools - Best Seo Tools Search Now!',
               'visible_link': 'About.com/Best Seo Tools'},
              {'link': 'http://45020106.r.msn.com/?ld=d3fjxPO_IBMPmvR0k1vrVO0zVUCUwdwFN31ryLdmieEW8NCMrtoRo9BZC_Rt6QNsMHdBqwNkwm2xTRf-bD-B9TZcEmXwmbbIYYkCU6q2Se1zsPlvS6j7PRSDszHqscGsegkzRkFCAxF1mAqvMiPFSs1ON2Eao&u=surferdudehits.com',
               'snippet': 'Bring 1,000 premium targeted visitors to your website for $7.95',
               'title': 'Premium Traffic for $7.95 | surferdudehits.com',
               'visible_link': 'surferdudehits.com'}],
 'num_results': '',
 'results': [{'link': 'http://best-seo-tools.net/',
              'snippet': None,
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},
             {'link': 'http://www.best-5.com/seo-tools/',
              'snippet': None,
              'title': '2014 Best SEO Tools | Best 5 SEO Tool Reviews',
              'visible_link': 'www.best-5.com/seo-tools'},
             {'link': 'http://seo-tools-review.toptenreviews.com/',
              'snippet': None,
              'title': 'SEO Tools Review 2014 | Best SEO Keyword Tools',
              'visible_link': 'seo-tools-review.toptenreviews.com'},
             {'link': 'http://www.bestseotools.net/',
              'snippet': None,
              'title': 'www.bestseotools.net',
              'visible_link': 'www.bestseotools.net'},
             {'link': 'http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html',
              'snippet': None,
              'title': 'Best SEO Tools - SEO & Inbound Marketing Blog …',
              'visible_link': 'www.iblogzone.com/2012/02/best-seo-tools-for-2012.html'},
             {'link': 'http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/',
              'snippet': None,
              'title': 'The Best SEO Tools: What, How, and Why - …',
              'visible_link': 'www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842'},
             {'link': 'http://moz.com/blog/100-free-seo-tools',
              'snippet': None,
              'title': '100 Free SEO Tools & Resources for Every …',
              'visible_link': 'moz.com/blog/100-free-seo-tools'},
             {'link': 'http://bestseotools.com/',
              'snippet': None,
              'title': 'Best SEO Tools of 2014',
              'visible_link': 'bestseotools.com'},
             {'link': 'http://www.socialseo.com/the-top-15-free-seo-tools.html',
              'snippet': None,
              'title': 'The Best 15 Free SEO Tools Online - Top SEO …',
              'visible_link': 'www.socialseo.com/the-top-15-free-seo-tools.html'},
             {'link': 'http://www.link-assistant.com/',
              'snippet': None,
              'title': 'Link-Assistant.Com - Official Site',
              'visible_link': 'www.link-assistant.com'},
             {'link': 'http://best-seo-tools.net/',
              'snippet': 'SEO Company : Spider view This tool ... Site Ranking. Website Cloaking Check This tool lets you check a list of urls for googlebot cheaters : SEO Company ...',
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},
             {'link': 'http://www.best-5.com/seo-tools/',
              'snippet': 'Looking for SEO tools? Our reviews of the best Search Engine Optimization Tools will help you choose the program that is best for you. Make the right choice',
              'title': '2014 Best SEO Tools | Best 5 SEO Tool Reviews',
              'visible_link': 'www.best-5.com/seo-tools'},
             {'link': 'http://seo-tools-review.toptenreviews.com/',
              'snippet': 'Looking for the best SEO tools? Read expert reviews and compare features of the best, cheapest and sometimes free SEO tools.',
              'title': 'SEO Tools Review 2014 | Best SEO Keyword Tools',
              'visible_link': 'seo-tools-review.toptenreviews.com'},
             {'link': 'http://www.bestseotools.net/',
              'snippet': 'Dominio registrato con Totalhosting.it Potresti essere interessato anche a : Power by ; Copyright 2014 Phonia Srl-P.I.02050680442-All Rights Reserved',
              'title': 'www.bestseotools.net',
              'visible_link': 'www.bestseotools.net'},
             {'link': 'http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html',
              'snippet': 'SEO Tools are designed to help make our SEO efforts a bit easier and less tedious. While there are many out there, here are some SEO tools to get you started.',
              'title': 'Best SEO Tools - SEO & Inbound Marketing Blog …',
              'visible_link': 'www.iblogzone.com/2012/02/best-seo-tools-for-2012.html'},
             {'link': 'http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/',
              'snippet': "Power-charge your SEO with the industry's finest SEO tools. Rankings, backlinks, competitors, reports, analytics - you name it - all in one place.",
              'title': 'The Best SEO Tools: What, How, and Why - …',
              'visible_link': 'www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842'},
             {'link': 'http://moz.com/blog/100-free-seo-tools',
              'snippet': 'At Moz, we love using premium SEO Tools. Paid tools are essential when you need advanced features, increased limits, historical features, or professional support. For ...',
              'title': '100 Free SEO Tools & Resources for Every …',
              'visible_link': 'moz.com/blog/100-free-seo-tools'},
             {'link': 'http://bestseotools.com/',
              'snippet': 'Comprehensive List of the Best SEO Tools of 2014 - Updated Monthly. Find the Best and Top Rated SEO Tools',
              'title': 'Best SEO Tools of 2014',
              'visible_link': 'bestseotools.com'},
             {'link': 'http://www.socialseo.com/the-top-15-free-seo-tools.html',
              'snippet': 'The Top 15 Free SEO Tools Posted September 13th, 2007 by Brian Gilley. We are building out a more comprehensive list of SEO and social media tools that you might …',
              'title': 'The Best 15 Free SEO Tools Online - Top SEO …',
              'visible_link': 'www.socialseo.com/the-top-15-free-seo-tools.html'},
             {'link': 'http://www.link-assistant.com/',
              'snippet': 'Get all SEO tools in one pack - download free edition of SEO PowerSuite and get top 10 rankings for your site on Google and other search engines!',
              'title': 'Link-Assistant.Com - Official Site',
              'visible_link': 'www.link-assistant.com'}]}

--- 10 result with no snippet---- http://best-seo-tools.net/ http://www.best-5.com/seo-tools/ http://seo-tools-review.toptenreviews.com/ http://www.bestseotools.net/ http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/ http://moz.com/blog/100-free-seo-tools http://bestseotools.com/ http://www.socialseo.com/the-top-15-free-seo-tools.html http://www.link-assistant.com/ --- repeat 10 result with snippet---- http://best-seo-tools.net/ http://www.best-5.com/seo-tools/ http://seo-tools-review.toptenreviews.com/ http://www.bestseotools.net/ http://www.iblogzone.com/2012/02/best-seo-tools-for-2012.html http://www.searchenginejournal.com/the-best-seo-tools-what-how-and-why/60842/ http://moz.com/blog/100-free-seo-tools http://bestseotools.com/ http://www.socialseo.com/the-top-15-free-seo-tools.html http://www.link-assistant.com/

NikolaiT commented 9 years ago

Fixed this now. Many thanks, your help is really appreciated!

Fixed in parsing.py around line 180, in the inner most for loop:

# only add the parsed item if we haven't done so beforehand
# there are mutltiple selectors (to differentiate between html layout for distinct requst parameters)
# that might produce duplicate parsing data
if key in serp_result and serp_result[key] != value:
    serp_result[key] = value
NikolaiT commented 9 years ago

Actually the above code is bullshit. I fixed it differently (md5 hashdigest of the scraped data to detect duplicates). What do you think about it? (See last commit, should work now).

leadscloud commented 9 years ago

The fix can not take effect. i add a not good code to solve it, your code is solve the entirely duplicates result:

have_null_item = False
for k, v in serp_result.items():
    if v is None:
        have_null_item = True

 if not have_null_item and serp_result:
    link = serp_result['link']
    if not [e for e in search_results[result_type] if e['link'] == link]:
        search_results[result_type].append(serp_result)

the precondition is selector_dict must have a correct so it can parse right data.

In your code, you just process duplicate parsing results, but actually the result is not true duplicate , beacause part of which is repeated .

{'link': 'http://best-seo-tools.net/',
              'snippet': None,
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},

just the snippet is not diffrent. i need the result with text ,not the result with a snippet with None result.

{'link': 'http://best-seo-tools.net/',
              'snippet': 'SEO Company : Spider view This tool ... Site Ranking. Website Cloaking Check This tool lets you check a list of urls for googlebot cheaters : SEO Company ...',
              'title': 'BEST SEO TOOLS',
              'visible_link': 'best-seo-tools.net'},

I think all the search_results[result_type] must be have a link item, and is not None, Otherwise the item is not need. besides per page link is unique.

NikolaiT commented 9 years ago

Will look into it in a few hours. Edit:

Guess I am getting your point now. Allowing the parts of the results to be None was a conscious decision of mine, I just wasn't sure that it was a good one.

By now your code rejects a result if any item is None, this means it will throw away the parsed result if one of

But what if a search engine suddenly decides that a result mustn't consist of all the above listed elements? For example I tested with some keywords and got this result:

'results': [{'link': 'http://www.youtube.com/watch%3Fv%3DZ5vZISMNk_I',
==>              'snippet': None,
              'title': 'What If Your Mom Was Your GPS? - YouTube',
              'visible_link': 'www.youtube.com/watch?v=Z5vZISMNk_I'},
             {'link': 'http://www.youtube.com/watch%3Fv%3Dx3BBkrG55AE',
==>              'snippet': None,
              'title': 'If you are fighting with your parents, please '
                       'watch this. - YouTube',
              'visible_link': 'www.youtube.com/watch?v=x3BBkrG55AE'},
             {'link': 'http://www.quibblo.com/quiz/3fHIrW3/Does-Your-Mom-Hate-You',
              'snippet': 'Do you think your mom hates you or loves '
                         'you........? Take this ... She calls me fat \n'
                         'all the time if i make her mad which i dont mean '
                         'too......it makes me want to cry\xa0...',
              'title': 'Quiz: Does Your Mom Hate You?',
              'visible_link': 'www.quibblo.com/quiz/3fHIrW3/Does-Your-Mom-Hate-You'},
             {'link': 'http://www.amazon.de/What-Your-Made-Raisin-Buns/dp/1897174039',
              'snippet': 'Kommentar: Versand aus den USA. Lieferungszeit '
                         'ca. 2-3 Wochen. Wir bieten \n'
                         'Kundenservice auf Deutsch! Geringe '
                         'Abnutzungserscheinungen und minimale\xa0...',
              'title': 'What If Your Mom Made Raisin Buns?: Amazon.de: '
                       'Catherine ...',
              'visible_link': 'www.amazon.de/What-Your-Made-Raisin-Buns/.../1897174039'},
             {'link': 'http://teenshealth.org/teen/your_mind/Parents/texting_mom.html',
              'snippet': "What can I do if I've asked my mom to stop "
                         'texting and driving and made it clear ... \n'
                         "You've probably already made comments to your "
                         'mom in the car about her\xa0...',
              'title': "What If a Parent Won't Stop Texting While Driving? "
                       '- TeensHealth',
              'visible_link': 'teenshealth.org/teen/your_mind/Parents/texting_mom.html'},
             {'link': 'http://teenshealth.org/teen/your_mind/Parents/talk_depression.html',
              'snippet': "If you're like most people, you probably wish "
                         'your parent would start the \n'
                         "conversation. Sometimes a parent will ask what's "
                         'wrong. Much of the time, \n'
                         "though, it's up\xa0...",
              'title': 'Talking to Parents About Depression - TeensHealth',
              'visible_link': 'teenshealth.org/teen/your_mind/Parents/talk_depression.html'},
             {'link': 'http://www.creativebookpublishing.ca/en/index.cfm%3Fpid%3D58%26CatID%3D55%26InvID%3D321',
              'snippet': 'What if Your Mom Made Raisin Buns? ... if Your '
                         'Mom Made Raisin Buns? is the \n'
                         'story of a little boy who gets a bit too '
                         "involved in his mom's raisin bun baking.",
              'title': 'What if Your Mom Made Raisin Buns? - Creative Book '
                       'Publishing',
              'visible_link': 'www.creativebookpublishing.ca/en/index.cfm?pid=58...55...'},
             {'link': 'http://www.buzzfeed.com/daves4/minaj-texting',
              'snippet': "17 Oct 2014 ... It's a question we've all been "
                         'asked at one point in our life or another: what '
                         'would \n'
                         'happen if you only texted your mom using lyrics '
                         'to Nicki\xa0...',
              'title': 'What Happens If You Text Your Mom Using Only The '
                       'Lyrics To ...',
              'visible_link': 'www.buzzfeed.com/daves4/minaj-texting'},
             {'link': 'https://what-if.xkcd.com/107/',
              'snippet': "Since you didn't specify where in New Jersey "
                         "your mother lives, I'm going to \n"
                         "assume she's in Hackensack, because that's where "
                         "Miss Teschmacher's mother\n"
                         '\xa0...',
              'title': 'Letter to Mom - What If? - xkcd',
              'visible_link': 'https://what-if.xkcd.com/107/'},
             {'link': 'http://kidshealth.org/kid/grow/girlstuff/period_school.html',
              'snippet': "If you haven't had your period yet, talk to "
                         'someone who can help you get your \n'
                         'supplies together. This might be your mom, an '
                         'older female relative, or \n'
                         'whomever\xa0...',
              'title': 'Getting Your Period at School - KidsHealth',
              'visible_link': 'kidshealth.org/kid/grow/girlstuff/period_school.html'},
             {'link': 'http://www.wikihow.com/Deal-With-an-Alcoholic-Parent',
              'snippet': 'You can encourage seeking therapy for the '
                         "depression but don't be discouraged \n"
                         'or surprised if your parent refuses to entertain '
                         "this idea\x96it's fairly confronting as\xa0...",
              'title': 'How to Deal With an Alcoholic Parent: 11 Steps '
                       '(with Pictures)',
              'visible_link': 'www.wikihow.com/Deal-With-an-Alcoholic-Parent'},
             {'link': 'http://www.wikihow.com/Deal-With-Your-Parents-Shouting-at-You',
              'snippet': 'Before yelling is necessary, change your '
                         'behavior so your parents... ... Be sincere\n'
                         ", even if you don't think you did anything "
                         'wrong. Deal With Your Parents\xa0...',
              'title': 'How to Deal With Your Parents Shouting at You: 9 '
                       'Steps',
              'visible_link': 'www.wikihow.com/Deal-With-Your-Parents-Shouting-at-You'},
             {'link': 'http://kidshealth.org/teen/your_mind/families/coping_alcoholic.html',
              'snippet': "If you're like most teens, your life is probably "
                         'filled with emotional ups and downs, \n'
                         "regardless of what's happening at home. Add a "
                         'parent with a drinking problem\xa0...',
              'title': 'Coping With an Alcoholic Parent - KidsHealth',
              'visible_link': 'kidshealth.org/teen/your_mind/families/coping_alcoholic.html'},
             {'link': 'http://www.projecthopeful.org/2014/07/16/mom-t-rex',
              'snippet': '16 Jul 2014 ... What if your mother was a '
                         'Tyrannosaurus Rex? You desperately need your '
                         'mom \n'
                         'to keep you safe. You turn to her when you are '
                         'afraid, you rely\xa0...',
              'title': 'What if your mom was a T-Rex? - Project Hopeful',
              'visible_link': 'www.projecthopeful.org/2014/07/16/mom-t-rex'},
             {'link': 'http://rhrealitycheck.org/article/2008/05/09/what-if-your-mother-had-aborted-you/',
              'snippet': '9 May 2008 ... Far too much is made of a '
                         "mother's obligations to her children and far too "
                         'little of \n'
                         "a child's love for her mother. If fetuses could "
                         'love, I think they\xa0...',
              'title': '"What If Your Mother Had Aborted You?" - RH '
                       'Reality Check',
              'visible_link': 'rhrealitycheck.org/article/.../what-if-your-mother-had-aborted-you/'},
             {'link': 'http://www.teenvogue.com/advice/family-advice/2013-11/fighting-with-your-mom',
              'snippet': '13 Nov 2013 ... Moms, right? Sometimes they feel '
                         'like your best friend, sometimes you wonder if \n'
                         'you were adopted by an alien whose sole mission '
                         'is to ruin\xa0...',
              'title': "Fighting with Your Mom Is Inevitable, So Here's "
                       'What to Do When It ...',
              'visible_link': 'www.teenvogue.com/advice/family-advice/2013.../fighting-with-your-mom'},
             {'link': 'http://www.quora.com/What-if-your-mom-was-dad-and-vice-versa',
              'snippet': 'Answer 1 of 2: Then I\'d have to say, "Dad, '
                         "where's my bag , where's my blue shirt \n"
                         ', where is that folder ...?" and "Mom, where is '
                         'dad?".',
              'title': 'What if your mom was dad and vice versa? - Quora',
              'visible_link': 'www.quora.com/What-if-your-mom-was-dad-and-vice-versa'},
             {'link': 'http://www.entrepreneurs-journey.com/12015/what-happens-when-your-mother-dies/',
              'snippet': "22 Apr 2013 ... Mum didn't push me to do things "
                         "if I didn't want to \x96 and a very shy child "
                         '... you \n'
                         'were coming from and what you had been through '
                         'in your life.',
              'title': 'What Happens When Your Mother Dies - '
                       'Entrepreneurs-Journey.com',
              'visible_link': 'www.entrepreneurs-journey.com/.../what-happens-when-your-mother-dies/'},
             {'link': 'http://love.allwomenstalk.com/things-to-do-if-your-parents-dont-approve-of-your-relationship',
              'snippet': "If your parents don't approve of a relationship "
                         "that you're in, it can make things \n"
                         'really difficult between you and your '
                         '#boyfriend! One of the first things that most\xa0'
                         '...',
              'title': "7 Things to do if Your Parent's Don't Approve of "
                       'Your Relationship ...',
              'visible_link': 'love.allwomenstalk.com/things-to-do-if-your-parents-dont-approve-of-your- '
                              'relationship'},
             {'link': 'http://www.slate.com/articles/double_x/doublex/2013/06/rick_perry_says_wendy_davis_should_be_pro_life_because_her_mother_didn.html',
              'snippet': '28 Jun 2013 ... Texas Gov. Rick Perry is a '
                         'plainspoken man, but on Thursday he waded into '
                         'an \n'
                         'ageless existential debate. Speaking to the '
                         'National Right to\xa0...',
              'title': 'What if Your Mother Had Aborted You? - Slate',
              'visible_link': 'www.slate.com/.../rick_perry_says_wendy_davis_should_be_pro_life_ '
                              'because_her_mother_didn.html'},
             {'link': 'https://studentaid.ed.gov/sites/default/files/fafsa-parent.pdf',
              'snippet': "Aid (FAFSASM), and you're supposed to put "
                         'information about your parents on \n'
                         'the application. But what if your parents are '
                         'divorced? Remarried? What if you\xa0...',
              'title': 'Who Is My \x93Parent\x94 When I Fill Out the '
                       'FAFSASM - Federal Student Aid',
              'visible_link': 'https://studentaid.ed.gov/sites/default/files/fafsa-parent.pdf'},
             {'link': 'http://www.themotherco.com/2013/02/when-parents-yell-at-children/',
              'snippet': '7 Feb 2013 ... If kids feel parents have their '
                         'best interest at heart (and paying ... Your '
                         'child might \n'
                         "say \x93Don't worry, Mom, I'll be ready to take "
                         'my bath then.',
              'title': 'What Happens When Parents Yell at Children « '
                       'TheMotherCompany',
              'visible_link': 'www.themotherco.com/.../when-parents-yell-at-children/'},
             {'link': 'http://www.popsugar.com/moms/How-Tell-Your-Daughter-Mean-Girl-34598598',
              'snippet': "30 Nov 2014 ... There's a mean girl in just "
                         'about every school, clique, band, soccer team, \n'
                         'religious education class, or carpool. This type '
                         'of bullying is scary for\xa0...',
              'title': 'How to Tell If Your Daughter Is the Mean Girl | '
                       'POPSUGAR Moms',
              'visible_link': 'www.popsugar.com/moms/How-Tell-Your-Daughter-Mean-Girl-34598598'},
             {'link': 'http://wol.jw.org/en/wol/d/r1/lp-e/1102008123',
              'snippet': 'MILLIONS of youths endure the daily turmoil of '
                         "living with a parent who's hooked \n"
                         'on drugs or alcohol. If one of your parents is '
                         'enslaved to such an addiction,\xa0...',
              'title': '23 What if My Parent Is Addicted to Drugs or '
                       'Alcohol? - Watchtower ...',
              'visible_link': 'wol.jw.org/en/wol/d/r1/lp-e/1102008123'},
             {'link': 'http://www.gotoquiz.com/does_your_mother_love_you',
              'snippet': 'This quiz is for people if they really want to '
                         'know if your mom has love for you. \n'
                         'How do you know if your mother loves? Will love '
                         'is a complex word and mother\xa0...',
              'title': 'Does your mother love you? - GoToQuiz.com',
              'visible_link': 'www.gotoquiz.com/does_your_mother_love_you'},
             {'link': 'http://www.safekidsbc.ca/teens_report.htm',
              'snippet': 'If the worker does thinks that it is not abuse '
                         'or neglect, but there are problems \n'
                         'that need to be fixed, he may telephone your '
                         'parents or go out to meet the family\n'
                         '\xa0...',
              'title': 'What happens when I report child abuse? - Child '
                       'Abuse Prevention',
              'visible_link': 'www.safekidsbc.ca/teens_report.htm'},
             {'link': 'http://community.babycenter.com/post/a53546203/what_if_your_dh_wanted_you_to_co_your_mom',
              'snippet': '15 Nov 2014 ... What would you guys say if your '
                         'DH wanted to CO your mom because of some \n'
                         'dumb reason(the only thing my mom has done that '
                         'he could be\xa0...',
              'title': 'What if your DH wanted you to CO your mom? - '
                       'BabyCenter',
              'visible_link': 'community.babycenter.com/.../what_if_your_dh_wanted_you_to_co_your_ '
                              'mom'},
             {'link': 'http://web.mit.edu/adorai/www/seuss-technical-writing.html',
              'snippet': 'If your cursor finds a menu item followed by a '
                         'dash, And the double-clicking ... \n'
                         'want to RAM your ROM. Quickly turn off the '
                         'computer and be sure to tell your \n'
                         'mom!',
              'title': 'What if Dr Seuss Did Technical Writing? - MIT',
              'visible_link': 'web.mit.edu/adorai/www/seuss-technical-writing.html'},
             {'link': 'http://money.cnn.com/2014/06/19/pf/inherited-debt-adult-children/',
              'snippet': '19 Jun 2014 ... debt adult children If your '
                         'parents die before paying off their debts, you '
                         'may worry \n'
                         'creditors will come after you. Usually they '
                         "can't, but not\xa0...",
              'title': "Can you inherit your dead parent's debts? - Jun. "
                       '19, 2014',
              'visible_link': 'money.cnn.com/.../inherited-debt-adult-children/'},
             {'link': 'https://www.tumblr.com/tagged/what-if-your-mom-was-kony',
              'snippet': 'Post anything (from anywhere!), customize '
                         'everything, and find and follow what \n'
                         'you love. Create your own Tumblr blog today.',
              'title': 'what if your mom was kony | Tumblr',
              'visible_link': 'https://www.tumblr.com/tagged/what-if-your-mom-was-kony'},
             {'link': 'http://www.integrativecanceranswers.com/what-if-your-mom-had-cancer-should-you-be-worried/',
              'snippet': 'If Your Mother Had Cancer, Are You At Risk? What '
                         'is The Role of Family History? \n'
                         'Should You Get Tested? There is a link between '
                         'mothers, daughters and\xa0...',
              'title': 'What If Your Mom Had Cancer? | Should You Be '
                       'Worried ...',
              'visible_link': 'www.integrativecanceranswers.com/what-if-your-mom-had-cancer-should- '
                              'you-be-worried/'},
             {'link': 'https://twitter.com/menshumor/status/513017236382941184',
              'snippet': '19 Sep 2014 ... What if your mom sent you a '
                         'photo like this...? '
                         'https://cards.twitter.com/cards/\n'
                         '5vss79/4nzf \x85'
                         ' 0 replies 25 retweets 56 favorites. Reply. '
                         'Retweet\xa0...',
              'title': 'Men\'s Humor on Twitter: "What if your mom sent '
                       'you a photo like this ...',
              'visible_link': 'https://twitter.com/menshumor/status/513017236382941184'},
             {'link': 'http://thoughtcatalog.com/amanda-charest/2013/10/9-stages-you-will-go-through-when-you-find-out-your-dad-is-cheating/',
              'snippet': '3 Oct 2013 ... You tell your mom. In an e-mail, '
                         'of course. Because, what if you actually just \n'
                         "made this all up and it's really all just a big "
                         'misunderstanding?',
              'title': '9 Stages You Will Go Through When You Find Out '
                       'Your Dad Is ...',
              'visible_link': 'thoughtcatalog.com/.../9-stages-you-will-go-through-when-you-find-out- '
                              'your-dad-is-cheating/'},
             {'link': 'http://www.cnet.com/news/mom-creates-app-so-that-kids-cant-ignore-her-calls/',
              'snippet': '17 Aug 2014 ... When they do, you can unlock '
                         'their phone if you choose to do so. ... After '
                         'all, it \n'
                         "simply isn't good for your image to have to call "
                         'your mom back\xa0...',
              'title': "Mom creates app so kids can't ignore her calls - "
                       'CNET',
              'visible_link': 'www.cnet.com/.../mom-creates-app-so-that-kids-cant-ignore-her-calls/'},
             {'link': 'http://www.parents.com/pregnancy/signs/symptoms/signs-you-may-be-pregnant/',
              'snippet': "Wondering if you've got a baby on board? Pay "
                         'close attention to your body! And if \n'
                         'you spot a few of the following symptoms -- and '
                         'your period is MIA -- it may be\xa0...',
              'title': '13 Signs of Pregnancy - Parents.com',
              'visible_link': 'www.parents.com/pregnancy/signs/.../signs-you-may-be-pregnant/'},
             {'link': 'http://abovetheinfluence.com/when-a-parent-uses/',
              'snippet': "But if you are worried about your parent's "
                         'drinking or drug use, he or she might \n'
                         'have a disease \x96 alcoholism or a drug '
                         'addiction. These illnesses can cause a\xa0...',
              'title': 'When a Parent Uses - Above the Influence',
              'visible_link': 'abovetheinfluence.com/when-a-parent-uses/'},
             {'link': 'http://writers.ns.ca/library-database/fiction-children/what-if-your-mom-made-raisin-buns.html',
              'snippet': 'What If Your Mom Made Raisin Buns? Genre: '
                         'Fiction, Children. Publisher: \n'
                         'Tuckamore ... Your shopping cart is empty. Home '
                         '· Blog · Library Database · \n'
                         'About\xa0...',
              'title': "What If Your Mom Made Raisin Buns? | Writers' "
                       'Federation of Nova ...',
              'visible_link': 'writers.ns.ca/library.../what-if-your-mom-made-raisin-buns.html'},
             {'link': 'http://freethoughtblogs.com/amilliongods/2013/07/03/rick-perry-what-if-your-mother-aborted-you/',
              'snippet': '3 Jul 2013 ... Why are you proud that your '
                         'parents had to work to death to keep you in '
                         'spuds? \n'
                         "Surely if you were a foetus with no mind of it's "
                         'own and had no\xa0...',
              'title': 'Rick Perry \x96 What if Your Mother Aborted You - '
                       'Freethought Blogs',
              'visible_link': 'freethoughtblogs.com/.../rick-perry-what-if-your-mother-aborted-you/'},
             {'link': 'https://answers.yahoo.com/question/index%3Fqid%3D20141012064233AAEMRiV',
              'snippet': 'Then your probably fat.',
              'title': 'What if your mom calls you fat? - Yahoo Answers',
              'visible_link': 'https://answers.yahoo.com/question/index?qid...'},
             {'link': 'http://www.collegehumor.com/post/6985876/if-your-mom-wrote-drug-psas',
              'snippet': '18 Aug 2014 ... Designed by Rebecca Caplan, '
                         'images courtesy of Shutterstock.com View "If '
                         'Your \n'
                         'Mom Wrote Drug PSAs" and more funny posts on\xa0'
                         '...',
              'title': 'If Your Mom Wrote Drug PSAs - CollegeHumor Post',
              'visible_link': 'www.collegehumor.com/.../if-your-mom-wrote-drug-psas'},
             {'link': 'http://textfiles.com/uploads/kill_parents.txt',
              'snippet': 'Below are some nice simple and creative ways to '
                         'kill your parents, but first if you \n'
                         'are a little bit unsure whether to kill your '
                         'parents or not or you just need a good\xa0...',
              'title': 'How To Kill Your Parents: The Complete Guide. - '
                       'Textfiles',
              'visible_link': 'textfiles.com/uploads/kill_parents.txt'},
             {'link': 'http://www.agingcare.com/Articles/Ten-Reasons-Why-Your-Aging-Parent-May-Not-Be-Eating-Properly-And-What-You-Can-Do-About-It-133239.htm',
              'snippet': 'Proper nutrition is vital to your parent for '
                         'maintaining health, retaining and \n'
                         'building bone mass and, importantly, to enable '
                         'medications to work effectively in \n'
                         'the\xa0...',
              'title': "What to Do if Your Elderly Parent Won't Eat - "
                       'AgingCare.com',
              'visible_link': 'www.agingcare.com/.../Ten-Reasons-Why-Your-Aging-Parent-May-Not-Be- '
                              'Eating-Properly-And-What-You-Can-Do-Abou...'},
             {'link': 'http://kidsspace.torontopubliclibrary.ca/genBook85523_17343.html',
              'snippet': 'There are many things to do with a raisin bun '
                         'besides eat it.',
              'title': 'What If Your Mom Made Raisin Buns?: KidsSpace: '
                       'Toronto Public ...',
              'visible_link': 'kidsspace.torontopubliclibrary.ca/genBook85523_17343.html'},
             {'link': 'http://psychcentral.com/blog/archives/2013/06/19/teenage-pregnancy-10-tips-for-telling-your-parents/',
              'snippet': "19 Jun 2013 ... I'm assuming many things about "
                         'the relationship you have with your parents. '
                         'You \n'
                         'may be closer to one than the other, but if you '
                         'do want to tell\xa0...',
              'title': 'Teenage Pregnancy: 10 Tips for Telling Your '
                       'Parents - Psych Central',
              'visible_link': 'psychcentral.com/.../teenage-pregnancy-10-tips-for-telling-your-parents/'},
             {'link': 'http://www.workingmother.com/blogs/college-mom/what-if-your-child039s-college-major-won039t-get-them-job',
              'snippet': '20 Nov 2014 ... I am a working mom who has one '
                         'child in college and another in high ... If '
                         'your \n'
                         "college student doesn't have a plan but insists "
                         'on majoring in\xa0...',
              'title': "What If Your Child's College Major Won't Get Them "
                       '... - Working Mother',
              'visible_link': 'www.workingmother.com/...mom/what-if-your-child039s-college-major- '
                              'won039t-get-them-job'},
             {'link': 'http://www.cancer.org/cancer/cancercauses/geneticsandcancer/heredity-and-cancer',
              'snippet': '25 Jun 2014 ... You have 2 copies of most genes '
                         '\x96 one from each parent. ... For example, if '
                         'both \n'
                         "relatives are your mother's brothers it means "
                         'more than if one\xa0...',
              'title': 'Family Cancer Syndromes - American Cancer Society',
              'visible_link': 'www.cancer.org/cancer/cancercauses/.../heredity-and-cancer'},
             {'link': 'http://www.cosmopolitan.com/sex-love/advice/a2731/What-If-His-Parents-Dont-Like-You/',
              'snippet': "14 Nov 2008 ... Meeting your boyfriend's "
                         'parents? Cosmo has some tips for dealing with '
                         'his \n'
                         'family.',
              'title': "What If His Parents Don't Like You? - Cosmopolitan",
              'visible_link': 'www.cosmopolitan.com/sex.../What-If-His-Parents-Dont-Like-You/'},
             {'link': 'http://rcspirituality.org/ask-a-priest-what-if-my-mom-wouldnt-want-me-to-go-to-confession/',
              'snippet': "In this way you aren't \x93talking about "
                         'Catholicism\x94 with your mom; rather, you are \n'
                         'simply living your faith. If your mom finds out '
                         'and confronts you, then you could\xa0...',
              'title': "\x93Ask a Priest: What if my mom wouldn't want me "
                       'to go to confession ...',
              'visible_link': 'rcspirituality.org/ask-a-priest-what-if-my-mom-wouldnt-want-me-to-go-to- '
                              'confession/'},
             {'link': 'http://www.circleofmoms.com/question/what-do-your-teen-sneaking-out-1701629',
              'snippet': "Generally, moms set curfews for their teens' "
                         "safety and well-being, so it isn't just ... \n"
                         'If your teenager wants to be treated like an '
                         'adult, then they need to learn how\xa0...',
              'title': 'What to do if your teen is sneaking out - Circle '
                       'of Moms',
              'visible_link': 'www.circleofmoms.com/.../what-do-your-teen-sneaking-out-1701629'},
             {'link': 'http://www.sodahead.com/living/what-if-your-mom-was-a-godless-liberal-pro-choice-baby-aborter-who-had-abortions-every-time-she-be/question-4563615/',
              'snippet': '29 Oct 2014 ... Then you probably wouldnt be '
                         'reading or seeing this reply, Ummm great \n'
                         'question, She was and she is She just wasnt '
                         'committed to her politics\xa0...',
              'title': 'What if your mom was a Godless, liberal, '
                       'pro-choice baby-aborter ...',
              'visible_link': 'www.sodahead.com/...if-your-mom.../question-4563615/'}]}

Especially this type of result item is important:

             {'link': 'http://www.youtube.com/watch%3Fv%3Dx3BBkrG55AE',
              'snippet': None,
              'title': 'If you are fighting with your parents, please '
                       'watch this. - YouTube',
              'visible_link': 'www.youtube.com/watch?v=x3BBkrG55AE'},

As you can see the snippet is None because it is a youtube link, although the link is unique. This means results are thrown away although it might be interesting.

I have another idea: What do you think about trashing results if just the link is None and the link is a duplicate? Then we won't miss the above results?!

# only add items that have not None links.
# Avoid duplicates. Detect them by the link.
# If statement below: Lazy evaluation. The more probable case first.
if serp_result['link'] is not None and \
         not [e for e in self.search_results[result_type] if e['link'] == serp_result['link']]:
    self.search_results[result_type].append(serp_result)

Happy to hear from you.

NikolaiT commented 9 years ago

Bump, so you will notice the above edit.

leadscloud commented 9 years ago

No matter which can not be the perfect solution to the problem. your solution is better than me. but may be the 10 result is all with null snippet.

leadscloud commented 9 years ago

maybe can try this code, i test it ok

for result_type, selector_class in selector_dict.items():

            self.search_results[result_type] = []

            # start modify
            results = {}
            for selector_specific, selectors in selector_class.items():

                results = self.dom.xpath(
                    css_to_xpath('{container} {result_container}'.format(**selectors))
                )
                # if use current selector can not get results
                if not results:
                    continue
                else:
                    break

            for index, result in enumerate(results):
                serp_result = {}

                for selector_specific, selectors in selector_class.items():
                    to_extract = set(selectors.keys()) - {'container', 'result_container'}
                    selectors_to_use = {key: selectors[key] for key in to_extract if key in selectors.keys()}

                    for key, selector in selectors_to_use.items():
                        value = None
                        if selector.endswith('::text'):
                            try:
                                value = result.xpath(css_to_xpath(selector.split('::')[0]))[0].text_content()
                            except IndexError as e:
                                pass
                        else:
                            attr = re.search(r'::attr\((?P<attr>.*)\)$', selector).group('attr')
                            if attr:
                                try:
                                    value = result.xpath(css_to_xpath(selector.split('::')[0]))[0].get(attr)
                                except IndexError as e:
                                    pass
                            else:
                                try:
                                    value = result.xpath(css_to_xpath(selector))[0].text_content()
                                except IndexError as e:
                                    pass
                        if value is not None or key not in serp_result.keys():
                            serp_result[key] = value
                if serp_result:
                    self.search_results[result_type].append(serp_result)
            # end modify
NikolaiT commented 9 years ago

That looks good. Will test it :)

# if use current selector can not get results
if not results:
    continue
else:
    break

This logic will lead to no execution of the following inner loop whatever the variable results contains. Either it will always break the current loop and/or spring over it.

Proof:


for i in range(100):
    val = (i % 2) == 1 # will be true and false alternateley
    if not val:
        continue
    else:
        break
    # will never execute
    print('will never print this')

Therefore I will not use the above code.

Many greetings

leadscloud commented 9 years ago

these code is just get results, no need execution in inner loop.

for selector_specific, selectors in selector_class.items():

    results = self.dom.xpath(
        css_to_xpath('{container} {result_container}'.format(**selectors))
    )
    # if use current selector can not get results
    if not results:
        continue
    else:
        break
    # below code is null ,no need run it
NikolaiT commented 9 years ago

Yes, if results is None, then the loop:

for index, result in enumerate(results):

will never run. So there is no need for continue. And when results is something valid, then the loop will break altogether. I don't understand that ?

NikolaiT commented 9 years ago

Doubled results are solved. Closing this issue.