failing test "Test extracting the PDF url"

rejozenger commented 11 years ago

When running "newspeak test newspeak" on a fresh install:

DEBUG Updating kst-31981-1 : Tweede Kamer der Staten-Generaal
DEBUG Attempt matching by entry link https://zoek.officielebekendmakingen.nl/kst-31981-1.html
DEBUG Creating new entry https://zoek.officielebekendmakingen.nl/kst-31981-1.html
DEBUG Fetching https://zoek.officielebekendmakingen.nl/kst-31981-1.html
DEBUG Parsing HTML for https://zoek.officielebekendmakingen.nl/kst-31981-1.html
DEBUG Resolving XPath id('main-column')
DEBUG Extracted summary for kst-31981-1 : Tweede Kamer der Staten-Generaal from https://zoek.officielebekendmakingen.nl/kst-31981-1.html
DEBUG 0 enclosures added to entry kst-31981-1 : Tweede Kamer der Staten-Generaal
.
======================================================================
FAIL: Test extracting the PDF url from a Rijksoverheid announcement.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/www/tinalo/www/src/newspeak/newspeak/tests.py", line 527, in test_extract_pdf_rijksoverheid
    self.assertEquals(result, result_url)
AssertionError: '' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'
    "'' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'" = '%s != %s' % (safe_repr(''), safe_repr('http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'))
    "'' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'" = self._formatMessage("'' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'", "'' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'")
>>  raise self.failureException("'' != 'http://www.rijksoverheid.nl/bestanden/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011/eindrapport-audit-ciot-2011.pdf'")

-------------------- >> begin captured logging << --------------------
newspeak.utils: DEBUG: Fetching http://www.rijksoverheid.nl/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011.html
newspeak.utils: DEBUG: Parsing HTML for http://www.rijksoverheid.nl/documenten-en-publicaties/rapporten/2012/09/25/eindrapport-audit-ciot-2011.html
newspeak.crawler: DEBUG: Resolving XPath id('content-column')/descendant::div[@class='download-chunk']/descendant::a/attribute::href
newspeak.crawler: WARNING: XPath id('content-column')/descendant::div[@class='download-chunk']/descendant::a/attribute::href did not return a value, returning empty string.
--------------------- >> end captured logging << ---------------------

Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
newspeak                           0      0   100%
newspeak.admin                    27      0   100%
newspeak.conf                      0      0   100%
newspeak.conf.default             25     25     0%   6-163
newspeak.conf.newspeak             3      3     0%   4-16
newspeak.conf.urls                 4      0   100%
newspeak.crawler                 186     31    83%   33-37, 56, 119, 163, 251-253, 294-296, 320-322, 326-335, 347, 361, 377, 381, 387, 429-438, 462-463
newspeak.feeds                   101      6    94%   48, 63-64, 69-70, 169
newspeak.management                0      0   100%
newspeak.management.commands       0      0   100%
newspeak.migrations                0      0   100%
newspeak.models                   94      0   100%
newspeak.runner                   18     18     0%   1-45
newspeak.urls                      5      0   100%
newspeak.utils                    62      2    97%   102, 151
------------------------------------------------------------
TOTAL                            525     85    84%
----------------------------------------------------------------------
Ran 25 tests in 935.646s

FAILED (failures=1)
Destroying test database for alias 'default'...

dokterbob commented 11 years ago

This is likely to be a change in the Rijksoverheid's RSS feed. It seems that the XPath expression on 513 of the tests is not working anymore; probably the HTML structure has been changed.

We need to update the XPath both in the test as well as update the bof_feeds fixture.

dokterbob commented 11 years ago

I have updated the XPath expression for extracting PDF files for rijksoverheid.nl from: id('content-column')/descendant::div[@class='download-chunk']/descendant::a/attribute::href to id('content')//a[@class='download-chunk pdf']/@href.

The fixture feeds_bof is also updated - but I recommend updating this expression manually for the production environment. Something a lot like the following SQL should suffice (untested):

UPDATE newspeak_feed 
    SET enclosure_xpath="id('content')//a[@class='download-chunk pdf']/@href" 
    WHERE enclosure_xpath="id('content-column')/descendant::div[@class='download-chunk']/descendant::a/attribute::href"

In the original BOF fixture there where 8 occurrences.

It might be wise to watch the logs for:

WARNING XPath id('content-column')/descendant::div[@class='download-chunk']/descendant::a/attribute::href did not return a value, returning empty string.

After applying the above SQL, this warning should be gone.

bitsoffreedom / newspeak

failing test "Test extracting the PDF url" #34