Open fimad opened 4 months ago
Is this still planned? it would be nice to get better performance by default. However, fast-tagsoup
seems to be unmaintained.
This is something I'd like to do, but I don't have a concrete timeline. As mentioned in #111, major scalpel development currently happens in bursts when I have large blocks of free time and I'm not sure when that will occur next.
The maintenance of fast-tagsoup
is a bit concerning and looking through the code I see some TODOs and some commented out code which makes me nervous that there may be unhandled edge cases.
I think when seriously doing this I'd want to sit down and understand how fast-tagsoup
works so that if needed I'd feel comfortable fixing any bugs upstream if necessary.
While debugging some slow scraping recently I realized that the vast majority of the time was spent in
TagSoup.parseTags
. Naively swapping to the fast-tagsoup package results in a ~20x speedup for the case that I was debugging.Example of using the
fast-tagsoup
parser (note that it only works with strict ByteStrings):Given this ridiculous speedup, it probably makes sense to try and make this the default HTML parser for scalpel. Doing that would require figuring out how to work this into the existing API where the scraper class is parameterized by the underlying string type.
Maybe we do a breaking API change and finally do away with all the
StringLike
code once and for all.