crossminer / scava

https://eclipse.org/scava/
Eclipse Public License 2.0
18 stars 13 forks source link

Overlapping tasks and dates on same project affects Metrics #178

Open creat89 opened 5 years ago

creat89 commented 5 years ago

The new features of Scava that consist in updating or changing the metric to apply as well as the range of dates in which the analysis is going to be done, seems to be affecting some transient metrics, and in consequence, some historic ones.

The problem resides on the fact that the Mongo collections of historic metrics and some transient, such as severity or emotions, are never drop, they are only updated with new entries as more days are analyzed. These metrics were conceived to consider that the collections were always empty or if they were filled with data, this data was from previous days. And, most importantly, the data that was going to be added was from days after the last analysis day.

In OSSMeter, this behavior was never problematic, as the platform never ran over the same day twice. And if you wanted to redo the analysis of a project, that meant to drop the general collection of the project and restart from the beginning.

However, the fact that users can rerun the same metrics (or almost) on the same dates, are affecting the precision of historic metrics. The reason, the historic metrics and some transient are not being dropped and these contain already information. For example:

A user has decided to run a metric regarding emotions during all January on X project. After the analysis has been done, the user decides to update the task and also include sentiments, on the same period of time. Currently, Scava do not drop the collections of metrics, thus, on the second run, there is information regarding emotions already. At the end, the analysis of project X will show that the outcome regarding emotions do not match with those from the first analysis.

This happens on certain trans metrics which store information that is considered global and atemporal. For example, org.eclipse.scava.metricprovider.trans.committers.CommittersMetricProvider creates a collection that stores the committers as they appear on a project; this collection is never dropped thus on a second run committers will have the double of commits. Another example, is org.eclipse.scava.metricprovider.trans.bugs.emotions.EmotionsTransMetricProvider which stores the emotions seen in a project through the time globally.

In my case, I never observed this, because I always drop my collections when I want to test something. However, I noticed this while working with Martin, after four runs of the same task, the historic metrics used to have, in some cases, 4 entries of the same nature.

This is a serious bug, and the question is, how are we going to deal with it?

mhow2 commented 5 years ago

To narrow the behavior we are observing, I have run a task twice on project https://github.com/docdoku/docdoku-plm with a selection of metrics to produce bugs.emotions.comments. The time period is 13/Jan to 26/Jan 2019.

See the resulting JSON below.

We also notice that on the second pass, we get all the days returned in the response.

First pass ```json { "id": "bugs.emotions.comments", "projectId": "docdokuplm", "metricId": "org.eclipse.scava.metricprovider.historic.bugs.emotions.EmotionsHistoricMetricProvider", "name": "Comment Emotions", "description": "The number of comments containing each emotion", "type": "LineChart", "datatable": [ { "Date": "20190122", "Emotion": "__label__anger", "Comments": 1 }, { "Date": "20190123", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190124", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190125", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190126", "Emotion": "__label__anger", "Comments": 0 } ], "timeSeries": true, "ordinal": false, "x": "Date", "y": "Comments", "series": "Emotion" } ```
Second pass ```json { "id": "bugs.emotions.comments", "projectId": "docdokuplm", "metricId": "org.eclipse.scava.metricprovider.historic.bugs.emotions.EmotionsHistoricMetricProvider", "name": "Comment Emotions", "description": "The number of comments containing each emotion", "type": "LineChart", "datatable": [ { "Date": "20190114", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190115", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190116", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190117", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190118", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190119", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190120", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190121", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190122", "Emotion": "__label__anger", "Comments": 1 }, { "Date": "20190122", "Emotion": "__label__anger", "Comments": 1 }, { "Date": "20190123", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190123", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190124", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190124", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190125", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190125", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190126", "Emotion": "__label__anger", "Comments": 0 }, { "Date": "20190126", "Emotion": "__label__anger", "Comments": 0 } ], "timeSeries": true, "ordinal": false, "x": "Date", "y": "Comments", "series": "Emotion" } ```
creat89 commented 5 years ago

Thinking about this issue, I guess the best approach is to create a database for the project that it is directly linked with the task. So, if the task for a specific project is deleted, the DB for that task and project is dropped. If the task is modified, the DB is overwritten (this would have to be done by dropping the DB and creating one with the same name). The name of the DB could be projectID+taskID.

The only consequence of doing this, if I'm not wrong, is that the user would need to do the sum of different outputs to have the actual outcome of both periods. For example, the number of commits would have to be the sum of the results from projectID+taskID1 plus projectID+taskID2.

Another possible solution would reside in modifying the metrics, either historic or transient, to remove always the collections after an event trigger (e.g., change on the task). However, in my opinion, this doesn't solve all the problems, especially when a task has been done previously. For example, a user executes a Task1 at the end of January; at the end of February this user executes a Task2. However, Task2 was missing a metric, so, it needs to be amended. If to amend a task means to drop the tables, this would affect the outcomes from Task2 as part of the information used previously came from Task1.

A third solution, although quite complex I guess, is that the metrics store always the date of modification of a collection entry, and check if it exists already exists. If it is the case, then, just update. However, this means that metrics as committers or emotions instead of keeping information with respect to the last day of the analysis, they will have to keep a day history of, for example, how many committers or which emotions existed on each day.

blueoly commented 5 years ago

I was testing my metric providers today and I also noticed this behaviour (not only in my providers but generally). Specifically, I noticed that if we run two tasks with overlapping dates, then there are multiple entries about the same dates for the historic metric providers. And this seems a serious issue. If I am not terribly wrong, historic metric providers cannot have any info or control themselves about the dates that they compute their results, so the platform should be responsible to deal with this.

One rough solution would be to drop from the database the collections that correspond to the metric providers that are going to be executed again by a new task but this will result in potential loss of useful data.

Another more refined solution would be the platform to check if there is an entry about a given historic metric provider and a given date and if there is one, then the platform does not compute this specific metric again. But Ι do not know how much overhead this solution would produce, in matter of execution time.

creat89 commented 5 years ago

@blueoly, I don't think that do not execute a historic metric for a specific day if it already exists is a solution either because users can change the starting date for an early one. For example, if starting date was B and then changed to A, the historic metrics between A and B are going to have some data, but the historic metrics from B and forward are not going to consider the values of A, as the metrics were not updated. This can be an issue especially on metrics that have cumulative data.

blueoly commented 5 years ago

I am not sure that the metric providers will not consider the values of the dates that will not be updated. The values will be there, I think that they can use them. But I do not have so thorough knowledge about how the computation of the historic metric providers works under the platform's hood.

ambpro commented 5 years ago

Hello all, with reference to the issue discussions above and after having a deep looking on the overlapping tasks issue, we planned the following workaround:

Please, let us know if that makes sense to you ?

MarcioMateus commented 5 years ago

Hi Amin,

These ideas make a lot of sense.

I don’t know how you plan to introduce these changes, but if you go for a step-by-step feature release, then in the first feature tasks with different metrics should be able to run over the same time range.

mhow2 commented 5 years ago

makes sense, yes but as Marcio said let's see how to proceed

davidediruscio commented 5 years ago

Hi Amin,

I think the changes you are planned make sense also to me. Could you please elaborate a bit more the first bullet point?

Thanks Davide

Il giorno gio 6 giu 2019 alle ore 18:42 Amin Boudeffa < notifications@github.com> ha scritto:

Hello all, with reference to the issue discussions above and after having a deep looking on the overlapping tasks issue, we planned the following workaround:

  • Prevent the possibility to analyze precedent and current days while running multiple tasks with a overlapping analyzing ranges. This feature will allow to avoid having redundant collections
  • Block the possibility to change the start-date field while editing a task.
  • Remove/override the analysis data collections linked to a task when delete/edit it consecutively.
  • Prevent the ability to rerun the metrics twice on two tasks in the same project. Basically, in case that we have more then one task by project, the end-user will not be able to select/visualize metrics which are already executed in the same time on another task within the same project (difficulty: heigh, seen that the dependencies between the metric-providers).

Please, let us know if that make sense to you ?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/crossminer/scava/issues/178?email_source=notifications&email_token=AAPPAECGFZNQM2KI6326UXDPZEV7LA5CNFSM4HGL3DHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDIVBA#issuecomment-499550852, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPPAEGOPX7AFUTEFAOWGPDPZEV7LANCNFSM4HGL3DHA .

--

Prof. Davide Di Ruscio Department of Information Engineering Computer Science and Mathematics University of L'Aquila Via Vetoio, Coppito I-67010 L'Aquila (Italy) Email: davide.diruscio@univaq.it WWW: http://people.disim.univaq.it/diruscio http://www.di.univaq.it/diruscio Skype: davidediruscio Twitter: DDiRuscio

ambpro commented 5 years ago

@MarcioMateus

I don’t know how you plan to introduce these changes, but if you go for a step-by-step feature release, then in the first feature tasks with different metrics should be able to run over the same time range.

@mhow2

makes sense, yes but as Marcio said let's see how to proceed

@davidediruscio

I think the changes you are planned make sense also to me. Could you please elaborate a bit more the first bullet point?

First of all, within the new abilities of the admin-ui to update metrics and time ranges trigger a problem of the analysed data consistency when having overlapping time range tasks. Some historic/transient metric-providers seem to be affecting by wrong results ( or redundant results) if we know that these metrics don't drop its collection but just update them only with new entries .

For this reason, we aim to prevent to run a metric(s) before a last execution date of the metric itself. Btw, this feature is almost implemented by the platform after the analysis process refactoring. FYI, the execution of a metric-provider consist of creating a collection called MetricExecutions on the scava-analysis database which provides a monitoring data of the execution of a metric-provider within a specific project via the lastExecutionDate field.

To be more clear,let say that we have two tasks:

Both tasks execute the same metric-providers: org.eclipse.scava.metricprovider.historic.bugs.emotions.EmotionsHistoricMetricProvider.

While running a branch (a list of metrics dependencies) within a worker, we need to check that it hasn't already been executed for this date look here. If it's the case, we will prevent the execution of these metrics for this date. As result of the previous tasks execution, we still have mongodb collection outputs from the first execution 21/01/2019 ---> 23/01/2019. Then, after rerun the second task, we will not be able to execute these metrics before the last-execution date 23/01/2019. The new outputs should be between 24/01/2019 ---> 26/01/2019 time range. The only way to update the first task output is through deleting the task and recreate a new one.

creat89 commented 5 years ago

Hello @ambpro, I think they are ok, but still I have some questions.

As result of the previous tasks execution, we still have mongodb collection outputs from the first execution 21/01/2019 ---> 23/01/2019. Then, after rerun the second task, we will not be able to execute these metrics before the last-execution date 23/01/2019. The new outputs should be between 24/01/2019 ---> 26/01/2019 time range. The only way to update the first task output is through deleting the task and recreate a new one.

If the user deletes the first task, that was between 21/01/2019-->23/01/2019, is the user able to create a task with an older date, such as 18/01/2019-->23/01/2019? But more importantly, is the user able to rerun metrics that were originally run in the task 21/01/2019 ---> 23/01/2019? If it will not be possible (to any of the previous questions), this would mean that in order to correct a task in old date, the user would need drop all the project and import it again, isn't?

ambpro commented 5 years ago

Hi @creat89,

If the user deletes the first task, that was between 21/01/2019-->23/01/2019, is the user able to create a task with an older date, such as 18/01/2019-->23/01/2019?

Yes, deleting the first task implies that the linked metrics collections will be dropped on the project analysis database. Then, the new created one should cover the execution of range 18/01/2019 --> 23/01/2019.

But more importantly, is the user able to rerun metrics that were originally run in the task 21/01/2019 ---> 23/01/2019?

Yes, totally.

If it will not be possible (to any of the previous questions), this would mean that in order to correct a task in old date, the user would need drop all the project and import it again, isn't?

No need to do that!

creat89 commented 5 years ago

Perfect!

mhow2 commented 5 years ago

@ambpro , I think the commit above could be the reason of #331

ambpro commented 5 years ago

@mhow2,

@ambpro , I think the commit above could be the reason of #331

I just pushed a patch (commit https://github.com/crossminer/scava/commit/ab4e424ea99ea641b04a116e8cf16514d67b2789) to prevent removing the analysis data of metrics shared between tasks in the same project.