MISP / PyMISP

Python library using the MISP Rest API
Other
431 stars 278 forks source link

PyMISP tags() method slow response on MISP ver 2.4.131 with 3k tags #650

Open chrisinmtown opened 3 years ago

chrisinmtown commented 3 years ago

The pymisp doc for the tags() method says simply "Get the list of existing tags". I expected this method to be implemented as roughly "select * from misp.tags;" at the database and return answers quickly. Instead, our version 2.4.131 system requires about 100 seconds to return a list of about 3,300 tags, using the default pythonify=False option.

I checked the database where I see queries like this running:

SELECT COUNT(*) AS `count` FROM `misp`.`attributes` AS `Attribute` 
LEFT JOIN `misp`.`events` AS `Event` ON (`Attribute`.`event_id` = `Event`.`id`) 
LEFT JOIN `misp`.`sharing_groups` AS `SharingGroup` ON (`Attribute`.`sharing_group_id` = `SharingGroup`.`id`) 
LEFT JOIN `misp`.`objects` AS `Object` ON (`Attribute`.`object_id` = `Object`.`id`)  
WHERE `Attribute`.`id` IN (39717, 39730, 39751, ..

Here's a sample Tag from the response list. Interestingly, the database column 'numerical_value' is not returned, but other fields are returned. I cannot find any explanation of fields "count" and "attribute_count" but I bet those are the usages of this tag:

{"id": "1221", "name": "malware_classification:malware-category=\"Botnet\"", "colour": "#22681c", "exportable": true,
  "org_id": "0", "hide_tag": false, "user_id": "0", "count": "0", "attribute_count": "0", "favourite": false}

I think the root cause of this issue is probably documentation, not software. If I'm right in guessing that the tags() method's results contain stats for every tag about its usage in events and attributes, then I'll be glad to submit a PR with an extended docstring for the tags() method.

chrisinmtown commented 3 years ago

Alternately, would you accept a feature request to make the behavior of the tags() method configurable? For example, add parameter usage; if True (default) then do the massive join and return the count/attribute fields, otherwise do not.

Rafiot commented 3 years ago

@mokaddem @iglocska any idea what do do there? Maybe allow to paginate?

chrisinmtown commented 3 years ago

I suggest applying the following patch to explain the tags() method response.

diff --git a/pymisp/api.py b/pymisp/api.py
index 05c3b25..24bbb16 100644
--- a/pymisp/api.py
+++ b/pymisp/api.py
@@ -787,7 +787,9 @@ class PyMISP:
     # ## BEGIN Tags ###

     def tags(self, pythonify: bool = False) -> Union[Dict, List[MISPTag]]:
-        """Get the list of existing tags.
+        """Get the list of existing tags and their usage.
+        In the response, each tag has key 'count' with the number of tagged events
+        and key 'attribute_count' with the number of tagged attributes.

         :param pythonify: Returns a list of PyMISP Objects instead of the plain json output. Warning: it might use a lot of RAM
         """
Rafiot commented 3 years ago

Done for the docstring: https://github.com/MISP/PyMISP/commit/7e84c36406905e8f976571cf0dcf2f3c3e9b5fc3

Thanks!

chrisinmtown commented 3 years ago

The improved docstring addresses my question here, thank you, feel free to close this issue. Alternately I could rewrite this issue as a feature request: tags() should accept pagination parameters.

Rafiot commented 3 years ago

As it doesn't resolves the performance problem, and that we do not have a clean solution for that now, we will keep it open.