arshaw / scrapemark

Super-convenient web scraping in Python
96 stars 28 forks source link

Nested loops are broken in scrapemark 0.9 #6

Open arshaw opened 13 years ago

arshaw commented 13 years ago

Reported by adb...@gmail.com, Dec 26, 2009

What steps will reproduce the problem?

Run the nested loops example from http://arshaw.com/scrapemark/docs/examples/

What is the expected output? What do you see instead?

Expected output is: {'days': [{'number': 1, 'points': [5.6, 24.5]}, {'number': 2, 'points': [1.1, 12.8]}, {'number': 3, 'points': [2.4, 5.67]}]}

Instead, you get: {'days': [{'points': [5.6], 'number': 1}, {'points': [24.5]}, {'points': [1.1], 'number': 2}, {'points': [12.8]}, {'points': [2.4], 'number': 3}, {'points': [5.67], 'number': 0}]}

What version of the product are you using? On what operating system?

v0.9

Please provide any additional information below.

This is a regression from scrapemark.py r2, which works fine.

mtaran commented 13 years ago

scrapemark would be the absolute best template-based html scraper if it weren't for this bug! I really hope you have a chance to fix it soon. I tried my hand at it, but just changing _merge_captures didn't seem to be enough since it looks like it gets called both at times when the master and slave dicts should be fully merged and when they shouldn't.

I also tried modifying the examples you had into doctest-compatible docstrings, like so: '''

Scrape some text:

>>> scrape("""
...    <title>:: {{ page_title }}</title>
...    """,
...    html)
{'page_title': u'The Page Title'}

Scrape some text (quick version):

>>> scrape("""
...    <title>:: {{ }}</title>
...    """,
...    html)
u'The Page Title'

Loop over certain divs, scrape a list:

>>> scrape("""
...    <body>
...    {*
...        <div class='section' id='{{ [section_ids] }}' />
...    *}
...    </body>
...    """,
...    html)
{'section_ids': [u'content', u'footer']}

Scrape text before a certain element:

>>> scrape("""
...    <div id='content'>
...    {{ before_table }}
...    <table />
...    </div>
...    """,
...    html)
{'before_table': u'Look at these data points'}

Scrape a column from a table (as a list of ints):

>>> scrape("""
...    <table>
...    <tr />
...    {*
...        <tr>
...        <td>{{ [day_numbers]|int }}</td>
...        </tr>
...    *}
...    </table>
...    """,
...    html)
{'day_numbers': [1, 2, 3]}

Scrape the entire table with nested loops and dot-notation:
>>> scrape("""
...    <table>
...    <tr />
...    {*
...        <tr>
...        <td>{{ [days].number|int }}</td>
...        {*
...            <td>{{ [days].[points]|float }}</td>
...        *}
...        </tr>
...    *}
...    </table>
...    """,
...    html)
{'days': [{'number': 1, 'points': [1.0, 1.5]},
          {'number': 2, 'points': [2.0, 2.5]},
          {'number': 3, 'points': [3.0, 3.5]}]}

'''

which would hopefully make it easier to do regression tests...

Anyways, I'd be really happy if you could get this fixed sometime :D

Tell me if there's anything I could do to help!