Open arshaw opened 13 years ago
scrapemark would be the absolute best template-based html scraper if it weren't for this bug! I really hope you have a chance to fix it soon. I tried my hand at it, but just changing _merge_captures didn't seem to be enough since it looks like it gets called both at times when the master and slave dicts should be fully merged and when they shouldn't.
I also tried modifying the examples you had into doctest-compatible docstrings, like so: '''
Scrape some text:
>>> scrape("""
... <title>:: {{ page_title }}</title>
... """,
... html)
{'page_title': u'The Page Title'}
Scrape some text (quick version):
>>> scrape("""
... <title>:: {{ }}</title>
... """,
... html)
u'The Page Title'
Loop over certain divs, scrape a list:
>>> scrape("""
... <body>
... {*
... <div class='section' id='{{ [section_ids] }}' />
... *}
... </body>
... """,
... html)
{'section_ids': [u'content', u'footer']}
Scrape text before a certain element:
>>> scrape("""
... <div id='content'>
... {{ before_table }}
... <table />
... </div>
... """,
... html)
{'before_table': u'Look at these data points'}
Scrape a column from a table (as a list of ints):
>>> scrape("""
... <table>
... <tr />
... {*
... <tr>
... <td>{{ [day_numbers]|int }}</td>
... </tr>
... *}
... </table>
... """,
... html)
{'day_numbers': [1, 2, 3]}
Scrape the entire table with nested loops and dot-notation:
>>> scrape("""
... <table>
... <tr />
... {*
... <tr>
... <td>{{ [days].number|int }}</td>
... {*
... <td>{{ [days].[points]|float }}</td>
... *}
... </tr>
... *}
... </table>
... """,
... html)
{'days': [{'number': 1, 'points': [1.0, 1.5]},
{'number': 2, 'points': [2.0, 2.5]},
{'number': 3, 'points': [3.0, 3.5]}]}
'''
which would hopefully make it easier to do regression tests...
Anyways, I'd be really happy if you could get this fixed sometime :D
Tell me if there's anything I could do to help!
Reported by adb...@gmail.com, Dec 26, 2009
What steps will reproduce the problem?
Run the nested loops example from http://arshaw.com/scrapemark/docs/examples/
What is the expected output? What do you see instead?
Expected output is: {'days': [{'number': 1, 'points': [5.6, 24.5]}, {'number': 2, 'points': [1.1, 12.8]}, {'number': 3, 'points': [2.4, 5.67]}]}
Instead, you get: {'days': [{'points': [5.6], 'number': 1}, {'points': [24.5]}, {'points': [1.1], 'number': 2}, {'points': [12.8]}, {'points': [2.4], 'number': 3}, {'points': [5.67], 'number': 0}]}
What version of the product are you using? On what operating system?
v0.9
Please provide any additional information below.
This is a regression from scrapemark.py r2, which works fine.