ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

Fixed side effect from invocation of cleaner in unfluff.lazy #21

Open franza opened 10 years ago

franza commented 10 years ago

I was sure that I checked that for #16 but it seems that I missed that.

cleaner mutates original doc object so doc needs to be re-calculated. So right now after cleaner is applied we will suffer from side effect. Consider next example:

[fs, unfluff] = ['fs', 'unfluff'].map require

html = fs.readFileSync('test_tags_kexp.html', 'utf8')

doc1 = unfluff.lazy html
doc2 = unfluff.lazy html

console.log 'tags1: ', doc1.tags() # ['Dennis Morton', 'film', 'kusp film review', 'Stand Up Guys']
console.log 'text1: ', doc1.text()

console.log 'text2: ', doc2.text()
console.log 'tags2: ', doc2.tags() # [ ]

Using this code over test_tags_kexp.html fixture we will have different results for tags() since cleaner is called inside text(). So when cleaner is called we need to reload document. Besides, I added some refactoring.

ageitgey commented 10 years ago

Thanks for catching this! I'll take a look in detail when I have some time this weekend.

franza commented 10 years ago

Sure. If you have ideas how we can avoid reloading document bring it up.

ageitgey commented 10 years ago

Sorry, I've been lax on reviewing this. Still plan to get to this very soon. Thanks!