CASM-Consulting / springcrawler

Apache License 2.0
0 stars 0 forks source link

[WIP] ACLEDScraper to support source rules #20

Closed Punchwes closed 3 years ago

Punchwes commented 4 years ago

ACLEDScraper now supports user input queries, job.json loaded queries and the mixture of them (in order to support the mix usage, fixed the scope to be body and added root/root’s value into corresponding field’s list for job.json loaded instances). (norconexTagger support is also available in this PR/branch, the norconexTagger branch/PR only focus on ACLEDTagger, its ACLEDScraper does not support user input selectors)

test samples on Imagen del Golfo :

crawlArgs.source = crawlArgs.source.put(Source.SCRAPER_RULE_ARTICLE, "div.siete60 div#contenido");
crawlArgs.source = crawlArgs.source.put(Source.SCRAPER_RULE_TITLE, "div.siete60 div.SlaBLK22");
crawlArgs.source = crawlArgs.source.put(Source.SCRAPER_RULE_DATE, "div.siete60 div.RobBLK12");

tried all user input, all loaded from job.json and the mix of these two;