Intevation / intelmq-certbund-contact

IntelMQ expert bots to lookup contact information in a database (part of the intelmq-cb-mailgen solution).
GNU Affero General Public License v3.0
3 stars 2 forks source link

Can memory usage be improved? #14

Open bernhardreiter opened 3 years ago

bernhardreiter commented 3 years ago

Running the ripe importer uses quite a bit of memory (~8GB as of today). Can this be reduced?

Analysis

Downloading the ripe files for 2021-02-12 gives us the raw size we want to process:

2021-02-12> gzip -l *

gzip: delegated-ripencc-latest: not in gzip format
         compressed        uncompressed  ratio uncompressed_name
            8206950            78779140  89.6% ripe.db.aut-num
           25903102           507574226  94.9% ripe.db.inet6num
          242928983          3628042984  93.3% ripe.db.inetnum
            5843624            95550239  93.9% ripe.db.organisation
            4413327            77870566  94.3% ripe.db.role
          287295986          4387817155  93.5% (totals)

so 4.3 GB of uncompressed data uncompressed data.

Using https://pypi.org/project/memory-profiler/

# Debian Buster
apt-get install python3-memory-profiler python3-matplotlib

Decorating a few functions, where the memory consumption is:

--- a/intelmq_certbund_contact/ripe/ripe_data.py
+++ b/intelmq_certbund_contact/ripe/ripe_data.py
@@ -78,2 +78,3 @@ def add_common_args(parser):

+@profile
 def load_ripe_files(options) -> tuple:
@@ -205,2 +206,3 @@ def read_asn_whitelist(filename, verbose=False):

+@profile
 def parse_file(filename, fields, index_field=None, restriction=lambda x: True,
@@ -298,2 +300,3 @@ def parse_file(filename, fields, index_field=None, restriction=lambda x: True,

+@profile
 def build_index(obj_list, index_attribute):
@@ -441,2 +444,3 @@ def split_for_known_orgs(obj_list, organisation_index):

+@profile
 def sanitize_split_and_modify(obj_list, index, whitelist,
@@ -501,2 +505,3 @@ def sanitize_split_and_modify(obj_list, index, whitelist,

+@profile
 def convert_inetnum_to_networks(inetnum_list):
@@ -510,2 +515,3 @@ def convert_inetnum_to_networks(inetnum_list):

+@profile
 def convert_inet6num_to_networks(inet6num_list):
@@ -517,2 +523,3 @@ def convert_inet6num_to_networks(inet6num_list):

+@profile
 def process_inetnum_contacts(key, inet_list, inet_list_u, restrict_country):

We can get a plot, trying to import with a country restriction of NO:

env PYTHONPATH=/home/bern/dev/certbund-contact-git: python3-mprof run /home/bern/dev/certbund-contact-git/intelmq_certbund_contact/ripe/ripe_import.py -v --restrict-to-country NO --conninfo 'host=localhost port=5432 dbname=contactdb'
python3-mprof plot -t "ripe_importer memory profile 2021-12-02"

mprofile_20210212110015 dat

Here is the data file for interactive browsing (rename to remove the .txt suffix): mprofile_20210212110015.dat.txt