buzzbangorg / bsbang-crawler

Alpha project for crawling bioschemas JSON-LD
Apache License 2.0
4 stars 5 forks source link

Process crawled JSON-LD to multiple levels, possibly using another library #4

Open justinccdev opened 6 years ago

justinccdev commented 6 years ago

At the moment, bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept and even crawling a small amount is still useful.

However, this will need to become much more sophisticated in the long-term, crawling to some arbitrary depth of nested json-ld structures. We probably don't want to write this code ourselves (unless it's very easy) but use a library such as https://github.com/digitalbazaar/pyld if it has appropriate facilities.

Also need to check that this isn't obviated by Apache Nutch if we switch to that for crawling.

justinccdev commented 6 years ago

Also, https://github.com/RDFLib/rdflib-jsonld may be worth a look

justinccdev commented 6 years ago

For reference, SolrIndexer_process_configured_properties() is the method I'm talking about. As you can see, it simply looks at basic properties at the top layer of the JSON-LD, which was okay for a proof of concept but not enough for the long term.

justinccdev commented 6 years ago

Yeah, the primitive nature of the scan is really quite embarrassing now, so I intend to do work on this soon, probably before the possible scrapy/frontera port of this crawler.