buzzbangorg / bsbang-crawler

Alpha project for crawling bioschemas JSON-LD
Apache License 2.0
4 stars 5 forks source link

Improve schema properties configuration mechanism #3

Open justinccdev opened 6 years ago

justinccdev commented 6 years ago

At the moment, there is a bioschemas.__init__.py.DEFAULT_CONFIG but no easy way to override or replace it for a user (tests have their own mechanism). Need to make this easier to configure without having to a change a file under source control. I'm thinking we can keep the python format but have it as conf/bsbang-conf.py or similar.

innovationchef commented 6 years ago

In Issue #4 , It is mentioned that - bsbang-crawl does a very hokey top-level crawl of the JSON-LD captured. This only captures a very small amount of information, mainly because this was for proof of concept The current DEFAULT_CONFIG is what you are referring to, right? So do you want a system such that the users can give their own CONFIG for crawl?

justinccdev commented 6 years ago

Yes, allowing users to easily put in or override parts of the default config is what this is about. But currently, no config will allow indexing of the JSON-LD deeper than the very first layer of basic properties (e.g. in a Bioschemas DataCatalog it can grab the description and keywords when properly configured, but will do nothing with the provider since that's an embedded Person/Organization structure (and even deeper, Person itself potentially has lots of complex properties like Person.affiliation).

Handling this relies on some resolution of #4, though that in itself will need some sane way to configure the depth of the search (e.g. that we want DataCatalog.provider but not DataCatalog.provider.affiliation).