Open justinccdev opened 6 years ago
Also, https://github.com/RDFLib/rdflib-jsonld may be worth a look
For reference, SolrIndexer_process_configured_properties() is the method I'm talking about. As you can see, it simply looks at basic properties at the top layer of the JSON-LD, which was okay for a proof of concept but not enough for the long term.
Yeah, the primitive nature of the scan is really quite embarrassing now, so I intend to do work on this soon, probably before the possible scrapy/frontera port of this crawler.
At the moment, bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept and even crawling a small amount is still useful.
However, this will need to become much more sophisticated in the long-term, crawling to some arbitrary depth of nested json-ld structures. We probably don't want to write this code ourselves (unless it's very easy) but use a library such as https://github.com/digitalbazaar/pyld if it has appropriate facilities.
Also need to check that this isn't obviated by Apache Nutch if we switch to that for crawling.