dice-group / Squirrel

Squirrel searches and collects Linked Data
Other
22 stars 19 forks source link

Triples returned are not correct #108

Closed sritejakv closed 5 years ago

sritejakv commented 5 years ago

https://github.com/dice-group/Squirrel/blob/f5e798c50af231fd47f73907bb5b6c64251897b3/squirrel.worker/src/main/java/org/dice_research/squirrel/analyzer/impl/html/scraper/HtmlScraper.java#L138

Triples returned by HtmlScrapper.java have the same subject and predicate.

gsjunior86 commented 5 years ago

@sritejakv , please post the yaml file that you are using and the URL that you want to crawl.

sritejakv commented 5 years ago

For the url - http://prototype-gismontgomery.opendata.arcgis.com/datasets here is the yaml file - https://github.com/abhihc/Squirrel/blob/enhanced_data_portal_crawling/squirrel.worker/src/test/resources/html_scraper_analyzer/yaml/prototype_gismontgomery.yaml

here are the list of triples returned by the code in - https://github.com/abhihc/Squirrel/blob/enhanced_data_portal_crawling/squirrel.worker/src/main/java/org/dice_research/squirrel/analyzer/impl/html/scraper/HtmlScraper.java

http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination http://prototype-gismontgomery.opendata.arcgis.com/datasets/1959435a5a81409992b45cc976ac6d1b_0, http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination http://prototype-gismontgomery.opendata.arcgis.com/datasets/d3ba2027c4c8422c83282459c8c29c14_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/3f2dc8774b934b17b1c8bf9c2d05d45b_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/3f2dc8774b934b17b1c8bf9c2d05d45b_1, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/1959435a5a81409992b45cc976ac6d1b_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/d3ba2027c4c8422c83282459c8c29c14_0, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_17, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_39, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_28, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_15, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_2, http://projekt-opal.de/dataset#link @http://projekt-opal.de/dataset#link http://prototype-gismontgomery.opendata.arcgis.com/datasets/7fab007eab034c45a53d858859eec341_6, http://projekt-opal.de/dataset#pagination @http://projekt-opal.de/dataset#pagination "Next", http://purl.org/dc/terms/title @http://purl.org/dc/terms/title "1-10 of 215 results", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "jay.mukherjee", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "melissa.noakes", http://purl.org/dc/terms/publisher @http://purl.org/dc/terms/publisher "cgmcgove", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on March 26, 2014", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on May 07, 2018", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on June 18, 2014", http://purl.org/dc/terms/issued @http://purl.org/dc/terms/issued "on August 24, 2015"

gsjunior86 commented 5 years ago

@sritejakv

i found some issues in your file:

1 - The indentation was wrong.

2 - You forgot to specify the subject in the yaml file. Check it:

file_descriptor:
 check:
  domain: prototype-gismontgomery.opendata.arcgis.com
  ignore-request: true

 search-result-page:
  regex: datasets?q=
  resources:
   "$uri":
    "http://projekt-opal.de/dataset#link": .card-title a

 detail-page:
  regex: datasets
  resources:
   "$uri":
    "http://purl.org/dc/terms/title": div#main-message
    "http://purl.org/dc/terms/description": span#dataset-description
    "http://purl.org/dc/terms/publisher": span[iatemprop="author"]
    "http://purl.org/dc/terms/issued": span[itemprop="datePublished"]
    "http://purl.org/dsnotify/vocab/eventset/sourceDataset": ul#dataset-meta-list li:eq(0) a
    "http://www.w3.org/ns/dcat#accessURL": ul#dataset-meta-list li:eq(1) a
    "http://purl.org/dc/terms/license": ul#dataset-meta-list li:eq(2)
    "http://purl.org/dc/terms/modified": li[itemprop="dateModified"]
    "http://www.w3.org/ns/dcat#downloadURL": .dl-links a

The $Uri variable stands for the current URI.

3 - This website loads additional html data through ajax calls. This is a feature not yet available in the scraper. Because of that, it will not be able to capture the pagination links, because it is loaded after. Btw, the pagination is not working. But the crawling for the first page works fine for me.

4 - i removed redundant data from your file

sritejakv commented 5 years ago

@gsjunior86 Did you run the above yaml file using the HtmlScrapper.java file which is linked in the previous comment?

sritejakv commented 5 years ago

@gsjunior86 Thank you for the comments.