Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

Incorrect and maybe dangerous result #15

Closed entertainyou closed 8 years ago

entertainyou commented 8 years ago
describe('decode',function(){
  it('TEST',function(done){
    var article = '<title>title</title>' +
        '<body>' +
        '<p foo="&quot; width=1000 onclick=alert(10) test"> HELLO WORLD</p>' +
        '</body>';

    read({
      html: article,
    }, function(err, art){
      should.not.exist(err);
      console.log('art', art.content);
      done();
    });
  });
});

The output is

art <p foo="" width=1000 onclick=alert(10) test"> HELLO WORLD</p>

makes the html content not safe to use.

The issue seems lies in returning a decoded version of html, why decode it(Also, the decodeEntities option parsed to cheerio is set to false, maybe can set it to true)?

midudev commented 8 years ago

Have you tried to put tidyAttrs property to true? Anyway I think that you're right as it's not a desirable output.

Tjatse commented 8 years ago

It's some historical issues of cheerio, Cause' cheerio can not decode Chinese correctly sometimes, e.g. the value of title attribute will be &#1234;&#5678; with some node vers. so try to use tidyAttrs as @miduga recomended.

entertainyou commented 8 years ago

@Tjatse , use tidyAttrs can work sometimes, but for the my case, the html content is more like(> is encoded in one of p's attribute value)

var read = require('../'),
    chai = require('chai'),
    expect = chai.expect,
    should = chai.should();

describe('decode',function(){
  it('TEST1',function(done){
    var article = '<title>title</title>' +
        '<body>' +
        '<p foo="abc&gt;def"> HELLO WORLD</p>' +
        '</body>';

    read({
      html: article,
      tidyAttrs: true,
    }, function(err, art){
      should.not.exist(err);
      console.log('art', art.content);
      done();
    });
  });
});

when tidyAttrs is false, the result is <p foo="abc>def"> HELLO WORLD</p> when tidyAttrs is true, the result is <p>def"> HELLO WORLD</p>.

entertainyou commented 8 years ago

@Tjatse ,do you have a small case of cheerio fails to decode chinese?

Tjatse commented 8 years ago

cheeriojs/cheerio#565

Tjatse commented 8 years ago

I can simply fixed this by removing the entities module and let cheerio handle the encodings by itself, but unfortunately, this problem occasionally happening in v0.19.0 and I need time to dig deep.

Tjatse commented 8 years ago

I think we must get this fixed somehow immediately, a new option like forceDecode?

entertainyou commented 8 years ago

An option should be great, :+1:

Tjatse commented 8 years ago

I've published an alpha ver. read-art@0.4.6-alpha. The GA will come soon.

entertainyou commented 8 years ago

What does GA mean?

Tjatse commented 8 years ago

General Availability -> 0.4.6