USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Revamp Sparkler Config - Type checking + Validation + *-site.yaml overriding #93

Open thammegowda opened 7 years ago

thammegowda commented 7 years ago

There are two FIXME: in configuration:

First, support loading sparkler-defaults.yaml and sparkler-site.yaml. The common practice is *-default.yaml provides default and recommended values from developers. The *-site.yaml should beused by users to customize or to override the defaults. Right now sparkler-default.yaml is supported and sparkler-site.yaml is not yet supported.

Second, towards the stability of system and fail early when errors are introduced by humans. Right now YAML configuration is transformed to JSON and internally it is treated as JSON. The system wouldn't know ahead of time if config is invalid. There is no type check for values, no schema validation etc. This could introduce errors to system and crash in later stages when the input config is invalid. As you see, except the config module, the rest of system is in staticly typed and strictly enforced. I think it will be nice to fail early when the config is invalid.

thammegowda commented 7 years ago

Reply from @SHASHANK-PRO-05 in IRDS email thread (Posting here for visibility)

Hi All, I am working on default config override and to test the validity of the key I need to setup a datatype(String, Integer whatever it is) and regex(If URLs what type of URLs are acceptable) in YAML(Correct me If I am wrong). With this, I will setup an automated code which will check the regex and datatype whenever the config parameters expand. So my question is should I create a new YAML File for this or use SPARKLER_DEFAULT.yaml? My suggestion is I create a new YAML file and don't pollute default config. This is one way to solve the problem. If you have a better approach let me know I will look into that.

thammegowda commented 7 years ago

Reply from @thammegowda in IRDS mailing list

Hi Shashank, Yes that would be one way to solve problem and a nice approach. But I am not happy with regex and the complexities + errors they will introduce. Perhaps there should be a simpler solution than regex. I and karanjeet had a discussion on this:

  • Define Java POJO classes that match the structure of existing Sparkler config.
  • Use Yaml data bind to deserialize Yaml to Java POJO. Refer [1] section 'Complex graphs'
  • Validations will be taken care by java types. For example the java constructor for URL class will raise exception when url string is invalid.
  • Then basic check for nulls/missing values can be made at a later phase. perhaps there should be not null annotations and frameworks to scan java beans. If we want to modify structure of existing Yaml file to simplify this process, I am +1 to do it. Note: some keys we may not know ahead of time, especially in case of plugins section. One approach would be to assume a generic map in first phase and let the plugin utility validate at second phase. Open to suggestions and feedbacks.
  • Cheers, TG

  • [1] http://yamlbeans.sourceforge.net/

thammegowda commented 7 years ago

Reply from @SHASHANK-PRO-05 in IRDS mailing list

Hi Thamme, I understand your concern with regex. I will look into the solutions by using directions provided by you.

thammegowda commented 7 years ago

Reply from @SHASHANK-PRO-05 in IRDS mailing list:

Hi All, I am working with Hibernate-Validator and I am going to start writing the solution for the Configuration Management. Just wanted to give you heads up on some things 1) I am going to import some dependencies in pom.xml which are hibernate-validator, javax.el, and el-API. 2) I am planning to convert SparklerConfiguration such that it will map all the configs variables to java variables which will have annotations defined on them. 3) Extending of JSONObject in SparklerConfiguration, I am going to keep intact. But, I would recommend using Getters and Setters here. I will write the implementations of getters and setters and in later phases, we can upgrade to that(If this method is chosen).
4) I will run the validator instance during the Constructor Call, which will throw the exception if the keys are not valid. 5) I am making a masking function to override the sparkler-site to sparkler-default config map.

Let me know if anything needs to be altered or added.

Also, let me know some constraints you want to get validated in the configuration. I will work on that with this patch for FIXME in newDefaultConfig(Constants in Sparkler). Also, we should change the names where we are using newDefaultConfig to getSystemConfig(). This name change affects other parts of the code which I am not aware of right now so I am going to write a TODO there and not do that in this patch.

thammegowda commented 7 years ago

@SHASHANK-PRO-05 Sounds good to me. 1) Perfect! It's a great solution for bean validation. 2) 👍 for now. 3) 👍 4) Perfect 5) Awesome. That's very much needed.

I will be excited to see these changes coming! Send me a link to your repository and your branch, I will check them and provide early feedback as quickly as possible

CC @karanjeets Check this out.

sk-s-hub commented 7 years ago

@thammegowda @karanjeets Let me know if you find things that can be done better. https://github.com/SHASHANK-PRO-05/sparkler.git. I am doing development in "dev" branch. Also, I am creating reusable annotations in utils package. Do check that out also