Closed sritejakv closed 5 years ago
@sritejakv , please post the yaml file that you are using and the URL that you want to crawl.
For the url - http://prototype-gismontgomery.opendata.arcgis.com/datasets here is the yaml file - https://github.com/abhihc/Squirrel/blob/enhanced_data_portal_crawling/squirrel.worker/src/test/resources/html_scraper_analyzer/yaml/prototype_gismontgomery.yaml
here are the list of triples returned by the code in - https://github.com/abhihc/Squirrel/blob/enhanced_data_portal_crawling/squirrel.worker/src/main/java/org/dice_research/squirrel/analyzer/impl/html/scraper/HtmlScraper.java
http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination http://prototype-gismontgomery.opendata.arcgis.com/datasets/1959435a5a81409992b45cc976ac6d1b_0, http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination http://prototype-gismontgomery.opendata.arcgis.com/datasets/d3ba2027c4c8422c83282459c8c29c14_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/3f2dc8774b934b17b1c8bf9c2d05d45b_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/3f2dc8774b934b17b1c8bf9c2d05d45b_1, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/1959435a5a81409992b45cc976ac6d1b_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/d3ba2027c4c8422c83282459c8c29c14_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_17, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_39, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_28, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_15, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_2, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_6, http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination "Next", http://purl.org/dc/terms/title @http://purl.org/dc/terms/title "1-10 of 215 results", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "jay.mukherjee", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "melissa.noakes", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "cgmcgove", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on March 26, 2014", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on May 07, 2018", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on June 18, 2014", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on August 24, 2015"
@sritejakv
i found some issues in your file:
1 - The indentation was wrong.
2 - You forgot to specify the subject in the yaml file. Check it:
file_descriptor:
check:
domain: prototype-gismontgomery.opendata.arcgis.com
ignore-request: true
search-result-page:
regex: datasets?q=
resources:
"$uri":
"http://projekt-opal.de/dataset#link": .card-title a
detail-page:
regex: datasets
resources:
"$uri":
"http://purl.org/dc/terms/title": div#main-message
"http://purl.org/dc/terms/description": span#dataset-description
"http://purl.org/dc/terms/publisher": span[iatemprop="author"]
"http://purl.org/dc/terms/issued": span[itemprop="datePublished"]
"http://purl.org/dsnotify/vocab/eventset/sourceDataset": ul#dataset-meta-list li:eq(0) a
"http://www.w3.org/ns/dcat#accessURL": ul#dataset-meta-list li:eq(1) a
"http://purl.org/dc/terms/license": ul#dataset-meta-list li:eq(2)
"http://purl.org/dc/terms/modified": li[itemprop="dateModified"]
"http://www.w3.org/ns/dcat#downloadURL": .dl-links a
The $Uri variable stands for the current URI.
3 - This website loads additional html data through ajax calls. This is a feature not yet available in the scraper. Because of that, it will not be able to capture the pagination links, because it is loaded after. Btw, the pagination is not working. But the crawling for the first page works fine for me.
4 - i removed redundant data from your file
@gsjunior86 Did you run the above yaml file using the HtmlScrapper.java file which is linked in the previous comment?
@gsjunior86 Thank you for the comments.
https://github.com/dice-group/Squirrel/blob/f5e798c50af231fd47f73907bb5b6c64251897b3/squirrel.worker/src/main/java/org/dice_research/squirrel/analyzer/impl/html/scraper/HtmlScraper.java#L138
Triples returned by HtmlScrapper.java have the same subject and predicate.