Intevation / intelmq-certbund-contact

IntelMQ expert bots to lookup contact information in a database (part of the intelmq-cb-mailgen solution).
GNU Affero General Public License v3.0
3 stars 2 forks source link

Deal with email tags in RIPE db changes #11

Open bernhardreiter opened 3 years ago

bernhardreiter commented 3 years ago

If an organisation changes its email address, there is a warning when running ripe_diff.py.

However, if this email address has had extra information (like being disabled or tags), there maybe additional steps necessary. This issue should make sure that they are done by either flagging them to the update operator or doing the changes itself (if possible).

considerations

The table email_tag is independent from the manual and automatic tables and has effects on all email addresses when it is attempted to use the address.

So if for an organisation the contact with emailA is replaced by a different contact with emailB, what should happen?

Let us consider the cases. We have two email addresses:

They both maybe used in one or several contacts (both manual or automatic).

Case 1 both have no status info

Nothing to do (solved)

Case 2 emailA has status info, emailB does not

If emailB is to be used for the first time, it could inherit the status info of emailA automatically. (Though it is unclear if in all cases the new email address' operator actually wants this.)

Else it is not clear because the other organisations that already have emailB did not have that status so it cannot be transfered without surprising them. -> Flag for manual decision.

Case 3 emailA does not, emailB has status info

If emailB has status infos, it is already used by another organisation, so it is unclear. -> Flag for manual decison.

Case 4 emailA and emailB both have status info

If they are the same status info -> do nothing (solved).

Else (the info differs) and it is undecided which is the better one for the remaining orgs for emailB and the new one. -> Flag for decision.

-> Flag

As there are only a one subcase (2.1) where the script could automatically do the right thing, a good approach seems to be to flag the situation.

bernhardreiter commented 3 years ago

creating test cases

Using fresh temporary database with postgresql and download ripe data as described in the documentation. Then we import only a part of the ripe data and use Norway as an example (fishshell).

set -l day 2021-03-03
cp -a $day $day-trunc
cd $day-trunc
gzip -d --stdout ../$day/ripe.db.inet6num.gz | head --lines=100000 | gzip > ripe.db.inet6num.gz
gzip -d --stdout ../$day/ripe.db.inetnum.gz | head --lines=100000 | gzip > ripe.db.inetnum.gz

set -l b /home/bern/dev/certbund-contact-git
env PYTHONPATH=$b: python3 $b/intelmq_certbund_contact/ripe/ripe_import.py -v --restrict-to-country NO --conninfo 'host=localhost dbname=contactdb'

# using approximately 3% of the entries
psql -h localhost contactdb -c "select email from contact_automatic TABLESAMPLE BERNOULLI (3);" --tuples-only | \
 head --lines=-1 | cut --characters=2- >sample_emails.txt

# sed 's/\(.*\)/UPDATE contact_automatic SET email=\'x-\1\' WHERE email=\'\1\';/' sample_emails.txt >sample_emails.sql
# psql -h localhost contactdb -f sample_emails.sql

# sed  's/\(.*\)/x-\1/' sample_emails.txt >sample_emails_new.txt

env PYTHONPATH=$b: python3 $b/intelmq_certbund_contact/ripe/ripe_diff.py -v --restrict-to-country NO --conninfo 'host=localhost dbname=contactdb'
bernhardreiter commented 3 years ago

creating test cases

To find 10 email address that have been used in several contacts:

SELECT * FROM (
  SELECT COUNT(email), email FROM contact_automatic GROUP BY email ORDER BY COUNT(email) Desc) AS c 
WHERE count >1 LIMIT 10;

as single command to produce a list

psql  -h localhost contactdb -c "SELECT email
   FROM (SELECT COUNT(email), email FROM contact_automatic 
               GROUP BY email ORDER BY COUNT(email) Desc) AS c
   WHERE count >1 LIMIT 10;
 " --tuples-only | head --lines=-1 | cut --characters=2- > multiple_emails.txt

To change the email address of only one contact of several for one email, the sql command is more involved:

sed 's/\(.*\)/UPDATE contact_automatic SET email=\'x-\1\' WHERE contact_automatic_id IN (SELECT contact_automatic_id FROM contact_automatic WHERE email=\'\1\' LIMIT 1);/'

with the combination of tests/create_email_tag_cases.py , a lot more test cases can be created.

bernhardreiter commented 3 years ago

There is a difficulty: the ripe_import workflow does not look at the differences, it only completely deletes the old contents and inserts the new database completely. So in order to find the organisation with changed email addresses, a ripe_diff run is needed, but ripe_diff does not writing anything to the database.

One implementation strategy could be to let the diff run create the necessary changes in the email tags to be saved in a file and executed after the import run.