Closed sritejakv closed 5 years ago
@sritejakv , i tested the URL provied by you.
i found an error in the MicrodataParser, which was returning the collector empty. The triples that you see are the microformats found by the MicroformatMF2JParser.
I'm working in this fix and will let you know asap.
Thanks,
@sritejakv , i tried some solutions for microdata and rdfa (any23, semargl, java-rdfa) and all of then uses the SAX API to parser HTML documents. The problem is that the parser, does not deal with unclosed tags, like the tags, throwing exception and not extracting the entire document. I tested some online parsers in PHP and they seems to deal with this problem better.
The solution would be to write an alternative for these libs
@sritejakv , i solved the parsing problem by using a HTML cleaner before. You can use both RDFaAnalyzer and MicrodataAnalyzer now. But pay attention that the URI that you used, can return a RDF file. So, for this case, do not forget to include the RDFAnalyzer as well.
For the URI - https://data.cambridgeshireinsight.org.uk/dataset/empty-homes, When crawler is run only with MicrodataParser, MicroformatMF2JParser, RDFaSemarglParser analyzers, invalid triples are being printed to the log as shown. Total log file is attached with the issue.
log.txt