Describe the bug
I've observed it originally when testing my service, but I was also able to reproduce it with the YARA service. When updating signatures from a source, the updater service (base from AL library) sends new data to Elasticsearch (do_source_update), and then notifies another thread to get a new signature package from Elasticsearch (do_local_update), and then finally serves it to worker services.
When uploading data to Elastic, the updater through signature client makes a synchronous call, but does not request to wait for shreds to be refreshed. By default, Elastic finishes the bulk request independently of the refreshing. Effectively, if the Elastic isn't quick enough or the updater slow enough, the new signatures are not visible yet when the updater asks for a new signature package. Effectively, new updates are not downloaded back to the updater and are not exposed to the workers until another update of local files (worst case: on next scheduled update, e.g. next day).
To Reproduce
Steps to reproduce the behavior:
Set up YARA service with an update source you control, for example with a one signature.
Start the updater, wait until the data is downloaded. Check your data in the signature viewer and the signature in the updater file in the container.
Change your source data, e.g. by editing the signature. Trigger update (with source in GitHub, I needed to wait a little until they refreshed caches).
See update happening in logs (e.q. Imported 1/1 signatures from example into Assemblyline), but also a log No signature updates available. shortly after.
Check data in the signature viewer - should be the newest version - and compare with data in the signature package in the service - should still have the older version (!).
Trigger update once more, observe refreshed signature package.
Expected behavior
After successful update from source, data are immediately or in a short time available to download by workers.
The bulk API from Elastic exposes a parameter refresh requesting Elastic to e.g. wait for the refresh. By default, it does not wait. I did tests by hardcording wait_for value in the bulk() method in datastore/collection.py from assemblyline_base repo, and it fixed the problem. However, I don't know if it was intentionally not set or it has any other side effects somewhere else.
Screenshots
Environment (please complete the following information if pertinent):
Assemblyline Version: 4.5.0.29, Yara 4.5.10
Elasticsearch still v7
Additional context
During the debugging, I've confirmed that the Elasticsearch is just returning old last modified timestamp, it's not the issue in the service itself. This is a race condition, and I'm aware it may be harder to spot with more sources or different Elastic configuration. I believe it should generally be consistent, and if it's not just that the Elasticsearch in my configuration is slow, the impact may sometimes be rather big (e.g. Yara rules used a day later than expected).
Describe the bug I've observed it originally when testing my service, but I was also able to reproduce it with the YARA service. When updating signatures from a source, the updater service (base from AL library) sends new data to Elasticsearch (
do_source_update
), and then notifies another thread to get a new signature package from Elasticsearch (do_local_update
), and then finally serves it to worker services.When uploading data to Elastic, the updater through signature client makes a synchronous call, but does not request to wait for shreds to be refreshed. By default, Elastic finishes the bulk request independently of the refreshing. Effectively, if the Elastic isn't quick enough or the updater slow enough, the new signatures are not visible yet when the updater asks for a new signature package. Effectively, new updates are not downloaded back to the updater and are not exposed to the workers until another update of local files (worst case: on next scheduled update, e.g. next day).
To Reproduce Steps to reproduce the behavior:
Imported 1/1 signatures from example into Assemblyline
), but also a logNo signature updates available.
shortly after.Expected behavior After successful update from source, data are immediately or in a short time available to download by workers.
The bulk API from Elastic exposes a parameter
refresh
requesting Elastic to e.g. wait for the refresh. By default, it does not wait. I did tests by hardcordingwait_for
value in thebulk()
method indatastore/collection.py
fromassemblyline_base
repo, and it fixed the problem. However, I don't know if it was intentionally not set or it has any other side effects somewhere else.Screenshots
Environment (please complete the following information if pertinent):
Additional context During the debugging, I've confirmed that the Elasticsearch is just returning old last modified timestamp, it's not the issue in the service itself. This is a race condition, and I'm aware it may be harder to spot with more sources or different Elastic configuration. I believe it should generally be consistent, and if it's not just that the Elasticsearch in my configuration is slow, the impact may sometimes be rather big (e.g. Yara rules used a day later than expected).