[CSV feed] CSV feed flood ingestion with identical data

Lhorus6 commented 1 month ago

Description

CSV feed import seems buggy or not optimized.

In my case, I have an import from the Blocklist.de source, which contains around 30K IPs. E.g. at this moment, we have 27K entries in the source:

However, just for this small source, I currently find myself with 2.36M bundles in the queue and tons of works.

Environment

OCTI 6.3.4

Reproducible Steps

Steps to create the smallest reproducible scenario:

Create this CSV Mapper:

Create this CSV feed: https://lists.blocklist.de/lists/all.txt

Let it run for several hours, or even 24 - 48 hours, to see how it behaves.

Additional information

It seems to me that it only imports the data if the hash changes. So this source updates its file every 30 minutes? (because I have a work every 30min)

This seems unlikely, perhaps we have a bug in the hash generation that takes meta data as input? Just a guess

If the file does change continuously, maybe we shouldn't have to retrieve it every time, but just 2 times a day?

nino-filigran commented 1 month ago

I've started a feed to reproduce, will let you know about the output

richard-julien commented 1 month ago

We compute the hash on the full file. We cant really do much on term of data control. To prevent too much works I currently try to not create any job is there is something already in the queue

nino-filigran commented 1 month ago

I reproduced your issue @Lhorus6. So based on your comment @richard-julien , can I consider it as a "wont fix"?

Lhorus6 commented 1 month ago

Maybe it's not the hash calculation we have to play with, but there's something to be done in any case IMO. Here we're blowing up the ingestion queues

Julien said " To prevent too much works I currently try to not create any job is there is something already in the queue", so I guess he is testing possibilities for improvement

richard-julien commented 1 month ago

Yes. Testing PR opened here https://github.com/OpenCTI-Platform/opencti/pull/8617

Lhorus6 commented 2 weeks ago

Note: I think this issue is resolved by Julien's work. We tested a feed that was problematic and it's all good now.

OpenCTI-Platform / opencti