Code4SA / various-scrapers

Apache License 2.0
2 stars 2 forks source link

Islozwe scraper is not pulling text properly #6

Closed aserlich closed 10 years ago

aserlich commented 10 years ago

I am testing the text scraping and it is not functioning properly. I test looking at the Isoelzwe stories and we are only grabbing one line of text and the text is not associate with the story. Do we need to change the subtype on Isolezwe? It's the same as other Isolezwe papers?

> db.mycollection.find({publication:"Isolezwe",downloaded_at: {$gt: new Date(2014, 3, 1) } })[1]
{
    "_id" : ObjectId("533ab090a8a0b802e763c226"),
    "url" : "http://www.iol.co.za/asivumi-ukulunga-isimo-ehostela-lakwamashu-1.1669273",
    "publication" : "Isolezwe",
    "published" : ISODate("2014-04-01T09:08:56Z"),
    "downloaded_at" : ISODate("2014-04-01T14:26:55.399Z"),
    "text" : "I'm a 21 year old woman looking to meet men and women between the ages of 25 and 27.",
    "title" : "Asivumi ukulunga isimo ehostela laKwaMashu",
    "sub_type" : 1,
    "owner" : "IOL",
    "summary" : "<!--PSTYLE=WL Web Lead--><p>Kunezinsolo zokuthi sekubuye iqembu lezinkabi elibulala abantu ebese lihambile ehostela laKwaMashu.</p>"
}
> db.mycollection.find({publication:"Isolezwe",downloaded_at: {$gt: new Date(2014, 3, 1) } })[2]
{
    "_id" : ObjectId("533ab090a8a0b802e8c62c2b"),
    "url" : "http://www.iol.co.za/abenfp-baphume-behlehla-emzumbe-1.1669264",
    "publication" : "Isolezwe",
    "published" : ISODate("2014-04-01T09:03:56Z"),
    "downloaded_at" : ISODate("2014-04-01T14:26:55.577Z"),
    "text" : "I'm a 21 year old woman looking to meet men and women between the ages of 25 and 27.",
    "title" : "AbeNFP baphume behlehla eMzumbe",
    "sub_type" : 1,
    "owner" : "IOL",
    "summary" : "<!--PSTYLE=WL Web Lead--><p>Basinde kukubi abaholi beNFP abebekhankasa esizindeni se-ANC eMzumbe.</p>"
}
>```
adieyal commented 10 years ago

Very strange - when I test on my side it comes out perfectly. Will delete those articles and re-run them.

On 4 April 2014 09:03, aserlich notifications@github.com wrote:

I am testing the text scraping and it is not functioning properly. I test looking at the Isloezwe stories and we are only grabbing one line of text and the text is not associate with the story. Do we need to change the subtype on Isolezwe? It's the same as other Isolezwe papers?

db.mycollection.find({publication:"Isolezwe",downloaded_at: {$gt: new Date(2014, 3, 1) } })[1]{ "_id" : ObjectId("533ab090a8a0b802e763c226"), "url" : "http://www.iol.co.za/asivumi-ukulunga-isimo-ehostela-lakwamashu-1.1669273", "publication" : "Isolezwe", "published" : ISODate("2014-04-01T09:08:56Z"), "downloaded_at" : ISODate("2014-04-01T14:26:55.399Z"), "text" : "I'm a 21 year old woman looking to meet men and women between the ages of 25 and 27.", "title" : "Asivumi ukulunga isimo ehostela laKwaMashu", "sub_type" : 1, "owner" : "IOL", "summary" : "

Kunezinsolo zokuthi sekubuye iqembu lezinkabi elibulala abantu ebese lihambile ehostela laKwaMashu.

"}> db.mycollection.find({publication:"Isolezwe",downloaded_at: {$gt: new Date(2014, 3, 1) } })[2]{ "_id" : ObjectId("533ab090a8a0b802e8c62c2b"), "url" : "http://www.iol.co.za/abenfp-baphume-behlehla-emzumbe-1.1669264", "publication" : "Isolezwe", "published" : ISODate("2014-04-01T09:03:56Z"), "downloaded_at" : ISODate("2014-04-01T14:26:55.577Z"), "text" : "I'm a 21 year old woman looking to meet men and women between the ages of 25 and 27.", "title" : "AbeNFP baphume behlehla eMzumbe", "sub_type" : 1, "owner" : "IOL", "summary" : "

Basinde kukubi abaholi beNFP abebekhankasa esizindeni se-ANC eMzumbe.

"}>```

Reply to this email directly or view it on GitHubhttps://github.com/Code4SA/various-scrapers/issues/6 .

Adi Eyal Director Code for South Africa Promoting informed decision-making

phone: +27 78 014 2469 skype: adieyalcas linkedin: http://za.linkedin.com/pub/dir/Adi/Eyal web: http://www.code4sa.org twitter: @soapsudtycoon

For more information on how to participate in the open data community in South Africa, go to: http://www.code4sa.org/#community

adieyal commented 10 years ago

Fixed