Check whether the enrichment produced the expected result

canasdiaz commented 6 years ago

Due to the number of studies will increase during the next months, it is needed to design a way to allow the system to check whether the result obtained is the expected. The example of this is the "areas of code" analysis, where it is quite difficult to know whether the data set is correctly produced based on the raw data.

The idea of checking whether the result obtained is "correct" can be applied to more basic enrichment processes as Git where the relationship between raw and enriched is 1:1.

From the operations point of view, we would have to allow our users to know whether the result written in the enridhed indexes is the one expected or if any error was found. This information could be both added to the log or even to a dedicated index so it could be displayed on the dashboard itself.

From my point of view, I would expect to have a design to do this for all data sources and studies and a basic implementation for them. Let's imagine an example of how this could be written to the log:

for basic enrichments like Git: [quality-check] Git enrichment passed the checks
for enrichments 1:n, n:n something like: [quality-check] Areas of code study passed the checks

alpgarcia commented 6 years ago

By the time being, we have logs like:

2018-05-24 19:06:41,541 SUMMARY: Items total read/total processed/total written: 58168/399449/399449

This is from areas of code, but most of gelk processes could share similar information.

From my point of view, I suggest to audit all executions in a database, in such a way we have a minimum information of what happened, let's say:

Task	start date	end date	read	processed	written	status	message
git_enrich	24/05/18 10:00	24/05/18 10:30	1000	5000	5000	OK
areas_of_code	24/05/18 11:00	24/05/18 11:12	1000	2300	2300	OK
git_enrich	25/05/18 10:00	25/05/18 10:30	1000	5000	5000	OK
areas_of_code	25/05/18 11:00	25/05/18 11:05	100	120	120	KO	\<Error stack trace>

We can add more columns if we find more useful information we would like to have there. This way, we could query that table to know whether a given task is running properly, for instance, it is running on a daily basis, not taking so long, and number of items read and written are consistent from one day to another and even with the number of items we have in the indices.

All those checks can be done manually by an operator if needed, but I think it would be easy to do them automatically.

I will implement a service allowing tasks to log when they start and finish. The service would be pretty simple and would listen for that events. Each of those events would include all the info we want to log (dates, number of items read and written, etc.)

Just my 2 cents.

canasdiaz commented 6 years ago

@alpgarcia I think your idea would improve the tool, but I would drop those results to the log and use ELK to send them to a ElasticSearch database. In any case your aim is wider than the check we are discussing here.

How can the new studies check the data they are analyzing? how this information is logged by the orchestration tool (mordred in this case) ? I would work on these two questions first.

alpgarcia commented 6 years ago

First things first, maybe you want something else but:

How can the new studies check the data they are analyzing? how this information is logged by the orchestration tool (mordred in this case) ? I would work on these two questions first.

If you look for the log line I pasted above, that's the summary after finishing the study and tells you everything the study knows about what was done. It can be improved, can be misleading, or even buggy, but the code for that line comes from: https://github.com/chaoss/grimoirelab-elk/blob/master/grimoire_elk/enriched/ceres_base.py#L107

So you can be pretty sure it is written by all new studies as they are inheriting this method. Of course one could overwrite this method and forget this line, but that could be easily detected as you never get the line you are looking for.

About my proposal, I think we are closer than one might think at first glance:

@alpgarcia I think your idea would improve the tool, but I would drop those results to the log and use ELK to send them to a ElasticSearch database. In any case your aim is wider than the check we are discussing here.

I proposed a service for logging information and then let this service deciding where to write that information down, so no matter about how persistence is implemented in the data layer used by the service. The important thing (to me), is this service would grant that all people in town use the same format for logging. And I think a simple but functional first implementation of this service would be easy (it could be a web service or not, just a service imported from a common module).

Just think about one single point to control all log messages related to task execution in a common format :D

chaoss / grimoirelab-elk

Check whether the enrichment produced the expected result #353