dice-group / Squirrel

Squirrel searches and collects Linked Data
Other
23 stars 19 forks source link

Invalid triples are generated from Microformat analysers. #98

Closed sritejakv closed 5 years ago

sritejakv commented 5 years ago

For the URI - https://data.cambridgeshireinsight.org.uk/dataset/empty-homes, When crawler is run only with MicrodataParser, MicroformatMF2JParser, RDFaSemarglParser analyzers, invalid triples are being printed to the log as shown. Total log file is attached with the issue.

screenshot 2019-01-25 at 2 22 36 pm

log.txt

gsjunior86 commented 5 years ago

@sritejakv , i tested the URL provied by you.

i found an error in the MicrodataParser, which was returning the collector empty. The triples that you see are the microformats found by the MicroformatMF2JParser.

I'm working in this fix and will let you know asap.

Thanks,

gsjunior86 commented 5 years ago

@sritejakv , i tried some solutions for microdata and rdfa (any23, semargl, java-rdfa) and all of then uses the SAX API to parser HTML documents. The problem is that the parser, does not deal with unclosed tags, like the tags, throwing exception and not extracting the entire document. I tested some online parsers in PHP and they seems to deal with this problem better.

The solution would be to write an alternative for these libs

gsjunior86 commented 5 years ago

@sritejakv , i solved the parsing problem by using a HTML cleaner before. You can use both RDFaAnalyzer and MicrodataAnalyzer now. But pay attention that the URI that you used, can return a RDF file. So, for this case, do not forget to include the RDFAnalyzer as well.