alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Parse: Allow explizit `a` xpaths #191

Closed simonwoerpel closed 2 years ago

simonwoerpel commented 2 years ago

a one-liner with huge effect ;) for now, i could not encounter any negative side-effects about this in my scrapers.

for use in the parse stage when dealing with the include_paths param, it can be very useful to be able to reference exact xpaths to specific links (a tags) on a page, i.e. a tag with a specific css class for pagination.

consider this markup:

<div class="pagination">
<a class="previous" href="/1">previous</a>
<a class="next" href="/2">next</a>
</div>

in the current implementation, we cannot directly specify only the "next" page link, as in the xpath we would need to set .//div[@class="pagination"] so that memorious finds all a children

with this improvement, the "next" link can be directly specified via xpath .//a[@class="next"]

sunu commented 2 years ago

Thanks for taking the time to describe the use case @simonwoerpel! Looks very useful to me.