I was doing a quick lit review to see if/how the state-of-the-art in web content extraction had changed over the past few years, and came upon a conference paper from last September, Learning Web Content Extraction with DOM Features, that seems interesting, relevant, and performant. There's also code: see learnhtml. Is there any interest in implementing its feature set within dragnet, and evaluating model performance with such features? This could be related to updates proposed in Issue #85.
I was doing a quick lit review to see if/how the state-of-the-art in web content extraction had changed over the past few years, and came upon a conference paper from last September, Learning Web Content Extraction with DOM Features, that seems interesting, relevant, and performant. There's also code: see learnhtml. Is there any interest in implementing its feature set within dragnet, and evaluating model performance with such features? This could be related to updates proposed in Issue #85.