arshaw / scrapemark

Super-convenient web scraping in Python
96 stars 28 forks source link

Problem with nested loop #18

Open phoebebright opened 12 years ago

phoebebright commented 12 years ago

This could be a user error but have tried every permutation I can think of without success.
I'm using the versin of scrapemark.py updated on Aug 11, 2011.

Here is an example. If I pull the nested part out and manually split by
then scrapemark will process each line correctly, but the nested version only finds the first match.

from scrapemark import scrape

src = ''' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n\n\n\n\t\n\n\t\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDaily Naps\n\n\n\n\n\n\t\t\n\n

\t\n\n\n\n\t
\t\n\n\t\t\n\ufeff\t\tDailyNaps logo\n\ufeff\t\t
\n\t\t\t\t\n\t\t
\n\t
\n\n\t
\n\t\t\n\ufeff\t\tHome\n\n\t\tHorse Racing Results\n\n\t\tPrevious Results\n\n\t\tPrevious Racecards\n\t\t\n\t\tDailynaps Software\n\n\t\tCheck Horse Odds\n\n\t\tBetting Links\n\n\t\tContact\n\t\t\n\t\tAffiliates\t\t\n\t\t\n\n\t\t
\n\n\t
\n\n\n\n\t
\n\n\t\t
\n\n\t\t\t
\n\n\t\t\t\t
\n\n\t\t\t\t\t
\n\n\t\t\t\t\t\t\n\n\n\n

UK Horse Racing Results

\n\n


Sunday, 10 June 2012


\n

    <div style=\'font-weight:bold\'>Curragh</div>\n
    <b>2:20 : </b>2 Gale Force Ten (J P O\'Brien, 7-2 ); 5 Leitir Mor (K J Manning, 11-10 fav); 4 Hard Yards (C D Hayes, 16-1 ); 8 ran. 6 Newberry Hill (F M Berry, 11-4 2nd-fav); <br>
    <b>2:50 : </b>10 Alsium (C D Hayes, 7-1 ); 3 Cape Of Approval (W Lordan, 2-1 fav); 4 Flying Doha (W J Lee, 7-2 2nd-fav); 15 ran.<br>
    <b>3:20 : </b>7 Kateeva (L F Roche, 14-1 ); 5 Battleroftheboyne (B A Curtis, 12-1 ); 10 Erins Gal (R P Cleary, 20-1 ); 12 ran. 2 Lake George (R P Whelan, 5-1 joint-fav);  3 Allegra Tak (P J Smullen, 5-1 joint-fav); <br>
    <b>3:50 : </b>4 Sharestan (N G McCullagh, 8-11 fav); 2 Defining Year (S Foley, 8-1 ); 7 ran. 7 Learn (C O\'Donoghue, 3-1 2nd-fav); <br>

    <br></p>\n<br><br></font></div>\n\n\n\n\n<!--- middle (main content) column end -->\n\t\t\t\t\t\t<hr class="hide">\n\t\t\t\t\t</div>\n\t\t\t\t</div>\n\t\t\t\t<div id="leftColumn">\n\t\t\t\t\t<div class="inside">\n\t\t\t\t\t\t<!--- left column begin -->\n\t\t\t\t\t\t<div class="vnav">\n<center>\n<iframe allowtransparency="true" src="http://media.paddypower.com/ad.aspx?bid=3079&pid=10060697" \nwidth="120" height="600" marginwidth="0" marginheight="0" hspace="0" vspace="0" \nframeborder="0" scrolling="no"></iframe>\n</center>\n\n\n\t\t\t\t\t\t\t<br />\n\t\t\t\t\t\t\t<br />\n\ufeff<center>\t\t\t\t\t\t\n\n<p></p>\n<p></p>\n\n<a href="http://media.paddypower.com/redirect.aspx?pid=10060697&bid=4403">\n<img src="http://media.paddypower.com/renderimage.aspx?pid=10060697&bid=4403" border=0></img ></a>\n\n<p></p>\n<p></p>\n\n</center>\n<br />\n<br />\n\t\t\t\t\t\t</div>\n\t\t\t\t\t\t<!--- left column end -->\n\t\t\t\t\t\t<hr class="hide">\n\t\t\t\t\t</div>\n\t\t\t\t</div>\n\t\t\t\t<div class="clear"></div>\n\t\t\t</div>\n\t\t\t<div id="rightColumn">\n\t\t\t\t<div class="inside">\n\t\t\t\t\t<!--- right column begin -->\n<p></p>\n<center>\n<iframe src="http://serve.williamhill.com/promoLoadDisplay?member=jpowell79&campaign=DEFAULT&channel=DEFAULT&zone=1471696800&lp=0" style="height:600px;width:120px;" frameborder="0" scrolling="no" MARGINWIDTH="0" MARGINHEIGHT="0" ></iframe>\n</center>\n<p></p>   \n\t\t\t\t   <br />\n\t\t\t\t   <br />\n\n<center>\t\t\t\t\t\t\n\n<p></p>\n<p></p>\n\n<a href="http://media.paddypower.com/redirect.aspx?pid=10060697&bid=3519">\n<img src="http://media.paddypower.com/renderimage.aspx?pid=10060697&bid=3519" border=0></img ></a>\n\n<p></p>\n<p></p>\n\n</center><br />\n<br />\n\t\t\t\t\t<!--- right column end -->\n\t\t\t\t\t<hr class="hide">\n\t\t\t\t</div>\t\t\t\t\n\t\t\t</div>\n\t\t\t<div class="clear"></div>\n\t\t</div>\n\t</div>\t\t\t\n\t<div id="footer" class="inside">\n\t\t<!-- footer begin -->\n\ufeff\t\t<a href="index.php">Home</a>\n\n\t\t<a href="results.php">Horse Racing Results</a>\n\n\t\t<a href="previous-results.php">Previous Results</a>\n\n\t\t<a href="previous-racecards.php">Previous Racecards</a>\n\t\t\n\t\t<a href="strategy.php">Dailynaps Software</a>\n\n\t\t<a href="free_bets.php">Check Horse Odds</a>\n\n\t\t<a href="links.php">Betting Links</a>\n\n\t\t<a href="contact.php">Contact</a>\n\t\t\n\t\t<a href="affiliate.php">Affiliates</a>\t\t<!-- footer end -->\n\t\t<hr class="hide">\n\t</div>\n</div>\n</body>\n</html>\n\n
    '''

THIS ONLY RETURNS THE FIRST MATCH

results = scrape("""

UK Horse Racing Results

       <div>
       <b>{{date}}</b>
       {*
       <div>{{ [course] }}</div>
         {\*  <b>{{h}}:{{m}} :</b> {{first}}; {{second}}; {{third}}; {{n}} ran <br /> *}
       *}
        </div>
    """,
html = src)

print results

THIS WORKS

results = scrape("""

UK Horse Racing Results

       <div>
       <b>{{date}}</b>
       {*
       <div>{{ [course] }}</div>
        {{ results|html }}
       *}
        </div>
    """,
html = src)

src = results["results"].replace("\n", "")

x = src.split("
") for item in x:

r = scrape("""
            <b>{{h}}:{{m}} :</b> {{first}}; {{second}}; {{third}}; {{n}} ran
            """,
    item)
print r

print results

--------- RESULTS -----

{'third': u'4 Hard Yards (C D Hayes, 16-1 )', 'h': u'2', 'm': u'20', 'n': u'8', 'course': [u'Curragh'], 'second': u'5 Leitir Mor (K J Manning, 11-10 fav)', 'date': u'Sunday, 10 June 2012', 'first': u"2 Gale Force Ten (J P O'Brien, 7-2 )"} {'third': u'4 Hard Yards (C D Hayes, 16-1 )', 'h': u'2', 'm': u'20', 'n': u'8', 'second': u'5 Leitir Mor (K J Manning, 11-10 fav)', 'first': u"2 Gale Force Ten (J P O'Brien, 7-2 )"} {'third': u'4 Flying Doha (W J Lee, 7-2 2nd-fav)', 'h': u'2', 'm': u'50', 'n': u'15', 'second': u'3 Cape Of Approval (W Lordan, 2-1 fav)', 'first': u'10 Alsium (C D Hayes, 7-1 )'} {'third': u'10 Erins Gal (R P Cleary, 20-1 )', 'h': u'3', 'm': u'20', 'n': u'12', 'second': u'5 Battleroftheboyne (B A Curtis, 12-1 )', 'first': u'7 Kateeva (L F Roche, 14-1 )'} None None None None None {'date': u'Sunday, 10 June 2012', 'course': [u'Curragh'], 'results': "\n\n 2:20 : 2 Gale Force Ten (J P O'Brien, 7-2 ); 5 Leitir Mor (K J Manning, 11-10 fav); 4 Hard Yards (C D Hayes, 16-1 ); 8 ran. 6 Newberry Hill (F M Berry, 11-4 2nd-fav);
\n 2:50 : 10 Alsium (C D Hayes, 7-1 ); 3 Cape Of Approval (W Lordan, 2-1 fav); 4 Flying Doha (W J Lee, 7-2 2nd-fav); 15 ran.
\n 3:20 : 7 Kateeva (L F Roche, 14-1 ); 5 Battleroftheboyne (B A Curtis, 12-1 ); 10 Erins Gal (R P Cleary, 20-1 ); 12 ran. 2 Lake George (R P Whelan, 5-1 joint-fav); 3 Allegra Tak (P J Smullen, 5-1 joint-fav);
\n 3:50 : 4 Sharestan (N G McCullagh, 8-11 fav); 2 Defining Year (S Foley, 8-1 ); 7 ran. 7 Learn (C O'Donoghue, 3-1 2nd-fav);
\n\n

\n

"}