Important Change Identification Road Map/Plan

Based on the discussion with @suchthis @danielballan @trinberg in a recent call, here is the plan for important change identification and prioritization.

The task will follow this road map:

Classification of changes into two categories "primary"(worth taking a second look at) and "secondary" (less important than primary changes but can be looked at for some meaningful information). This will happen after passing the changes through the first filtering or pre-filtering layer which will tag the indisputable insignificant changes (date/time, social media, contact info etc.) and thus we will have three categories in the end - primary, secondary, and insignificant.
Improvement of classification model based on feedback and results.
Assigning a numerical priority or score to each of the changes in the primary and secondary category based on some features of the changes. The different categories will be separately prioritised and there will be no relation between the priorities of primary and secondary. For example - A high priority in secondary (let's say 0.9) will still be less than a low priority in primary (let's say 0.2).
Improvement of prioritization model based on feedback and results.

This map will be followed for all types of models i.e. text, source, and other changes based on different differs. Each model will have its own road map based on this general plan.

The creation of each type of model will have three basic steps or parts:

Dataset creation through extraction of relevant information from the source ( for example text from source). This will be followed by required pre-processing.
Model training using the dataset and validating its performance on a test set.
Real time classification/prioritization of new changes by passing them through the trained model.

The correctly classified data will be added to the dataset and the model will be retrained periodically.

Each change will only be tagged and classified and none of them will be removed from the list. This is to ensure that any important change which is incorrectly classified by the model isn't deleted or removed.

This issue is to define a clear process to follow for classification and prioritization. New contributors can also use this to add their own prioritization models.

This is open for discussion and suggestions are welcome. :)

edgi-govdata-archiving / web-monitoring

Important Change Identification Road Map/Plan #67