Code4SA / various-scrapers

Apache License 2.0
2 stars 2 forks source link

Citizen stories still coming in without text? #7

Open aserlich opened 10 years ago

aserlich commented 10 years ago

Looks like we are still getting zero length text text for some Caxton stories. See example below. Can we rerun to capture these and rerun database to check?

> db.mycollection.find({publication:"The Citizen",downloaded_at: {$gt: new Date(2014, 3, 3) } })[1]
{
    "_id" : ObjectId("533c8f2aa8a0b8592c2cf29d"),
    "publication" : "The Citizen",
    "author" : "Kate Henry",
    "url" : "http://citizen.co.za/154023/aces-stun-pirates/",
    "text" : "",
    "title" : "Aces stun Pirates",
    "summary" : "A late strike by Thulasizwe Mbuyane buried Soweto giants Orlando Pirates on Wednesday night, as Mpumalanga Black Aces cemented their place in the top half of the Premiership standings with a 1-0 win at Orlando Stadium.",
    "owner" : "Caxton",
    "published" : ISODate("2014-04-02T21:20:11Z"),
    "downloaded_at" : ISODate("2014-04-03T00:28:58.192Z"),
    "sub_type" : 2
}
aserlich commented 10 years ago

Hi Adi,

I looked at todays error logs and this still seems to be a problem. Any understanding of the problem?

2014-04-04 15:30:57,505 - __main__ - WARNING - Missing text from http://citizen.co.za/154875/vavi-ruling-victory-wasp/ [in scraper.py:26]
2014-04-04 15:30:57,507 - __main__ - WARNING - Missing text from http://citizen.co.za/154874/luthiania-takes-lead-sa/ [in scraper.py:26]
2014-04-04 16:30:26,015 - __main__ - WARNING - Missing text from http://citizen.co.za/154904/anc-sms-application-dismissed/ [in scraper.py:26]
2014-04-04 16:30:26,240 - __main__ - WARNING - Missing text from http://citizen.co.za/154893/swimmers-complete-459km-24-days/ [in scraper.py:26]
2014-04-04 16:30:26,798 - __main__ - WARNING - Missing text from http://citizen.co.za/154904/anc-sms-application-dismissed/ [in scraper.py:26]
2014-04-04 17:29:27,820 - __main__ - WARNING - Missing text from http://citizen.co.za/154927/bccsa-dismisses-complaint-702/ [in scraper.py:26]
2014-04-04 17:29:27,951 - __main__ - WARNING - Missing text from http://citizen.co.za/154922/eff-launches-gauteng-election-campaign/ [in scraper.py:26]
2014-04-04 17:29:29,289 - __main__ - WARNING - Missing text from http://citizen.co.za/154916/nehawu-consult-legal-advisors/ [in scraper.py:26]
2014-04-04 17:29:29,283 - __main__ - WARNING - Missing text from http://citizen.co.za/154918/slain-miner-fleeing-shot-commission/ [in scraper.py:26]
2014-04-04 17:29:29,939 - __main__ - WARNING - Missing text from http://citizen.co.za/154915/kzn-traffic-cop-shot-dead/ [in scraper.py:26]
2014-04-04 18:30:26,419 - __main__ - WARNING - Missing text from http://citizen.co.za/154935/court-rules-krejcir/ [in scraper.py:26]
2014-04-04 18:30:26,421 - __main__ - WARNING - Missing text from http://citizen.co.za/154953/hondas-187-kmh-lawnmower/ [in scraper.py:26]
2014-04-04 18:30:26,768 - __main__ - WARNING - Missing text from http://citizen.co.za/154930/vavi-voice-voiceless-sym/ [in scraper.py:26]
2014-04-04 18:30:27,202 - __main__ - WARNING - Missing text from http://citizen.co.za/154929/nkandla-rpeort-political-hands/ [in scraper.py:26]
adieyal commented 10 years ago

I've made some changes - will monitor the situation. I have also re-processed the old urls.

Adi

On 5 April 2014 16:03, aserlich notifications@github.com wrote:

Hi Adi,

I looked at todays error logs and this still seems to be a problem. Any understanding of the problem?

2014-04-04 15:30:57,505 - main - WARNING - Missing text from http://citizen.co.za/154875/vavi-ruling-victory-wasp/ [in scraper.py:26] 2014-04-04 15:30:57,507 - main - WARNING - Missing text from http://citizen.co.za/154874/luthiania-takes-lead-sa/ [in scraper.py:26] 2014-04-04 16:30:26,015 - main - WARNING - Missing text from http://citizen.co.za/154904/anc-sms-application-dismissed/ [in scraper.py:26] 2014-04-04 16:30:26,240 - main - WARNING - Missing text from http://citizen.co.za/154893/swimmers-complete-459km-24-days/ [in scraper.py:26] 2014-04-04 16:30:26,798 - main - WARNING - Missing text from http://citizen.co.za/154904/anc-sms-application-dismissed/ [in scraper.py:26] 2014-04-04 17:29:27,820 - main - WARNING - Missing text from http://citizen.co.za/154927/bccsa-dismisses-complaint-702/ [in scraper.py:26] 2014-04-04 17:29:27,951 - main - WARNING - Missing text from http://citizen.co.za/154922/eff-launches-gauteng-election-campaign/ [in scraper.py:26] 2014-04-04 17:29:29,289 - main - WARNING - Missing text from http://citizen.co.za/154916/nehawu-consult-legal-advisors/ [in scraper.py:26] 2014-04-04 17:29:29,283 - main - WARNING - Missing text from http://citizen.co.za/154918/slain-miner-fleeing-shot-commission/ [in scraper.py:26] 2014-04-04 17:29:29,939 - main - WARNING - Missing text from http://citizen.co.za/154915/kzn-traffic-cop-shot-dead/ [in scraper.py:26] 2014-04-04 18:30:26,419 - main - WARNING - Missing text from http://citizen.co.za/154935/court-rules-krejcir/ [in scraper.py:26] 2014-04-04 18:30:26,421 - main - WARNING - Missing text from http://citizen.co.za/154953/hondas-187-kmh-lawnmower/ [in scraper.py:26] 2014-04-04 18:30:26,768 - main - WARNING - Missing text from http://citizen.co.za/154930/vavi-voice-voiceless-sym/ [in scraper.py:26] 2014-04-04 18:30:27,202 - main - WARNING - Missing text from http://citizen.co.za/154929/nkandla-rpeort-political-hands/ [in scraper.py:26]

Reply to this email directly or view it on GitHubhttps://github.com/Code4SA/various-scrapers/issues/7#issuecomment-39639138 .

Adi Eyal Director Code for South Africa Promoting informed decision-making

phone: +27 78 014 2469 skype: adieyalcas linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal web: http://www.code4sa.org twitter: @soapsudtycoon

For more information on how to participate in the open data community in South Africa, go to: http://www.code4sa.org/#community

aserlich commented 10 years ago

Ok, thanks for the update! Will also keep my eye out.

aserlich commented 10 years ago

Looks like we have this problem coming up again with stories with actual text... Any ideas?

2014-04-30 10:29:33,803 - __main__ - WARNING - Missing text from http://boksburgadvertiser.co.za/195724/annual-mrs-south-africa-cansa-gala-dinner-2/ [in scraper.py:30]
2014-04-30 11:29:55,375 - __main__ - WARNING - Missing text from http://mpumalanganews.co.za/172006/ambulances-handed-spead-service-delivery/ [in scraper.py:30]
2014-04-30 14:30:49,553 - __main__ - WARNING - Missing text from http://southcoastsun.co.za/37427/toti-fc-u9-westville-hutchison-park/ [in scraper.py:30]
2014-04-30 15:06:48,708 - __main__ - WARNING - Missing text from http://www.wstandard.mobi/news/read/4372/scubi-forester-revisited [in scraper.py:30]
2014-04-30 15:31:36,209 - __main__ - WARNING - Missing text from http://roodepoortrecord.co.za/2014/04/30/tots-tweens-teens-competition-week-3-9-13-years/ [in scraper.py:30]
2014-04-30 15:31:38,851 - __main__ - WARNING - Missing text from http://roodepoortrecord.co.za/2014/04/30/tots-tweens-teens-competition-week-3-4-8-years/ [in scraper.py:30]
2014-04-30 16:46:05,434 - __main__ - WARNING - Missing text from http://www.tametimes.mobi/news/read/2857/20-years-of-freedom-and-democracy-campaign-support [in scraper.py:30]
2014-04-30 16:46:08,178 - __main__ - WARNING - Missing text from http://www.tametimes.mobi/news/read/2859/growth-for-spur-school-mountain-bike-league [in scraper.py:30]
2014-04-30 16:48:07,172 - __main__ - WARNING - Missing text from http://www.tametimes.mobi/news/read/2645/minister-of-sport-storms-out-of-al-jazeera-studio [in scraper.py:30]
2014-04-30 16:48:10,733 - __main__ - WARNING - Missing text from http://www.tametimes.mobi/news/read/2648/a-hippo-love-story [in scraper.py:30]
2014-04-30 16:50:27,715 - __main__ - WARNING - Missing text from http://www.tametimes.mobi/news/read/2754/the-rand-show-2014-it-s-showtime [in scraper.py:30]
2014-05-01 00:40:47,322 - __main__ - WARNING - Missing text from http://www.vrystaat.mobi/news/read/2042/spotprent-1-mei-2014 [in scraper.py:30]
2014-05-01 01:03:52,820 - root - ERROR - Error accessing url: {u'url': u'http://www.maluti.mobi/news/read/1254/winterwenke-vir-die-tuinier', u'entry': {}, u'scraper': u'naspers_local', u'publication': u'Maluti'} [in /var/www/scrapers/
adieyal commented 10 years ago

Some of them don't actually have bodies but there are a few that do. Looking into it.

On 1 May 2014 13:43, aserlich notifications@github.com wrote:

Looks like we have this problem coming up again with stories with actual text... Any ideas?

2014-04-30 10:29:33,803 - main - WARNING - Missing text from http://boksburgadvertiser.co.za/195724/annual-mrs-south-africa-cansa-gala-dinner-2/ [in scraper.py:30] 2014-04-30 11:29:55,375 - main - WARNING - Missing text from http://mpumalanganews.co.za/172006/ambulances-handed-spead-service-delivery/ [in scraper.py:30] 2014-04-30 14:30:49,553 - main - WARNING - Missing text from http://southcoastsun.co.za/37427/toti-fc-u9-westville-hutchison-park/ [in scraper.py:30] 2014-04-30 15:06:48,708 - main - WARNING - Missing text from http://www.wstandard.mobi/news/read/4372/scubi-forester-revisited [in scraper.py:30] 2014-04-30 15:31:36,209 - main - WARNING - Missing text from http://roodepoortrecord.co.za/2014/04/30/tots-tweens-teens-competition-week-3-9-13-years/ [in scraper.py:30] 2014-04-30 15:31:38,851 - main - WARNING - Missing text from http://roodepoortrecord.co.za/2014/04/30/tots-tweens-teens-competition-week-3-4-8-years/ [in scraper.py:30] 2014-04-30 16:46:05,434 - main - WARNING - Missing text from http://www.tametimes.mobi/news/read/2857/20-years-of-freedom-and-democracy-campaign-support [in scraper.py:30] 2014-04-30 16:46:08,178 - main - WARNING - Missing text from http://www.tametimes.mobi/news/read/2859/growth-for-spur-school-mountain-bike-league [in scraper.py:30] 2014-04-30 16:48:07,172 - main - WARNING - Missing text from http://www.tametimes.mobi/news/read/2645/minister-of-sport-storms-out-of-al-jazeera-studio [in scraper.py:30] 2014-04-30 16:48:10,733 - main - WARNING - Missing text from http://www.tametimes.mobi/news/read/2648/a-hippo-love-story [in scraper.py:30] 2014-04-30 16:50:27,715 - main - WARNING - Missing text from http://www.tametimes.mobi/news/read/2754/the-rand-show-2014-it-s-showtime [in scraper.py:30] 2014-05-01 00:40:47,322 - main - WARNING - Missing text from http://www.vrystaat.mobi/news/read/2042/spotprent-1-mei-2014 [in scraper.py:30] 2014-05-01 01:03:52,820 - root - ERROR - Error accessing url: {u'url': u'http://www.maluti.mobi/news/read/1254/winterwenke-vir-die-tuinier', u'entry': {}, u'scraper': u'naspers_local', u'publication': u'Maluti'} [in /var/www/scrapers/

— Reply to this email directly or view it on GitHubhttps://github.com/Code4SA/various-scrapers/issues/7#issuecomment-41901754 .

Adi Eyal Data Specialist phone: +27 78 014 2469 skype: adieyalcas linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal