Samita53 / ldspider

Automatically exported from code.google.com/p/ldspider
0 stars 0 forks source link

Microdata support #23

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello, 

I've been testing ldspider with different types of pages to check if it's 
extracting data correctly. RDF/XML page seems to be straightforward, but 
crawling an HTML page that contains microdata/rdfa markup doesn't seem to yield 
any data.

I'm using ldspider CLI support, here is the command:
java -jar ldspider.jar -any23 -c 2 -s seed.txt -o data.txt -a access-log.txt -v 
file-log.txt

access-log.txt content:
1347913722 1110 127.0.0.1 TCP_MISS/200 1909 GET 
http://www.guardian.co.uk/robots.txt - NONE/- text/plain
1347913722 0 127.0.0.1 TCP_MISS/499 -1 GET 
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared - NONE/- null

data.txt file is empty.

Here is a test page that contains some microdata markup:
http://www.guardian.co.uk/commentisfree/2012/sep/17/cameron-goes-where-thatcher-
never-dared
trying to extract embedded data with any23 service do yield some data 
http://any23.org/any23/best/http:/www.guardian.co.uk/commentisfree/2012/sep/17/c
ameron-goes-where-thatcher-never-dared

Any clues?
Thanks in advance.

Original issue reported on code.google.com by remon.sh...@gmail.com on 17 Sep 2012 at 8:33

GoogleCodeExporter commented 9 years ago
As per Andrias and Tobias in 
https://groups.google.com/forum/?fromgroups=#!topic/ldspider/jOCBkeYkKlU

The quick answer is: set -ctIgnore
because your URI has been blocked by a FetchFilter that only allows the 
downloading of documents that come with a rdf/xml mime type. -ctIgnore disables 
FetchFilters and passes everything not rdf/xml to any23. Note that there is 
only a subset of what any23 can do loaded by default, but RDFa is included, cf. 
-any23ext.

The more technical answer is: there is a inconsistency between FetchFilters, 
ldspider's HTTP Accept header and what the parsers loaded claim to do and what 
they're actually able to do.

Original comment by remon.sh...@gmail.com on 19 Sep 2012 at 2:24

GoogleCodeExporter commented 9 years ago
looks like this is settled

Original comment by tob.kae...@gmail.com on 15 Jul 2014 at 8:55