autogram-is / spidergram

Structural analysis tools for complex web sites
GNU General Public License v3.0
111 stars 4 forks source link

Cleaning up getPageContent #51

Closed eaton closed 1 year ago

eaton commented 1 year ago

As part of the work leading up to 8.0, the HtmlTools section and the getPageContent stuff in particular has become pretty chaotic; this set of changes consolidates the getPageText, getPageContent, and other related stuff into a single simpler function. It also removes the attempt to handle complex conditional branching when locating core page content; if different selectors are necessary for different Resources, handling it on on the side of the calling function ends up being much cleaner.