Closed entertainyou closed 8 years ago
Have you tried to put tidyAttrs
property to true? Anyway I think that you're right as it's not a desirable output.
It's some historical issues of cheerio
, Cause' cheerio
can not decode Chinese correctly sometimes, e.g. the value of title
attribute will be Ӓᘮ
with some node vers. so try to use tidyAttrs
as @miduga recomended.
@Tjatse , use tidyAttrs can work sometimes, but for the my case, the html content is more like(> is encoded in one of p's attribute value)
var read = require('../'),
chai = require('chai'),
expect = chai.expect,
should = chai.should();
describe('decode',function(){
it('TEST1',function(done){
var article = '<title>title</title>' +
'<body>' +
'<p foo="abc>def"> HELLO WORLD</p>' +
'</body>';
read({
html: article,
tidyAttrs: true,
}, function(err, art){
should.not.exist(err);
console.log('art', art.content);
done();
});
});
});
when tidyAttrs
is false, the result is <p foo="abc>def"> HELLO WORLD</p>
when tidyAttrs
is true, the result is <p>def"> HELLO WORLD</p>
.
@Tjatse ,do you have a small case of cheerio fails to decode chinese?
cheeriojs/cheerio#565
I can simply fixed this by removing the entities
module and let cheerio
handle the encodings by itself, but unfortunately, this problem occasionally happening in v0.19.0
and I need time to dig deep.
I think we must get this fixed somehow immediately, a new option like forceDecode
?
forceDecode: false
decodes by cheerio
.forceDecode: true
fallback to decode context by entities
but not cheerio
.An option should be great, :+1:
I've published an alpha ver. read-art@0.4.6-alpha
. The GA will come soon.
What does GA mean?
General Availability -> 0.4.6
The output is
makes the html content not safe to use.
The issue seems lies in returning a decoded version of html, why decode it(Also, the decodeEntities option parsed to cheerio is set to false, maybe can set it to true)?