center-for-threat-informed-defense / sightings_ecosystem

Sightings Ecosystem gives cyber defenders visibility into what adversaries actually do in the wild. With your help, we are tracking MITRE ATT&CK® techniques observed to give defenders real data on technique prevalence.
https://ctid.io/sightings-ecosystem
Apache License 2.0
33 stars 8 forks source link

option to read jsonl instead of json files #5

Closed zmallen closed 6 months ago

zmallen commented 2 years ago

When loading sightings data under ./data, https://github.com/center-for-threat-informed-defense/sightings_ecosystem/blob/main/src/pipeline/pipeline.py#L218 assumes that a file will be one massive JSON blob.

This can become untenable for large JSON blobs of sightings data, as python can be inefficient at processing larger files. This is especially true for high throughput applications that can log millions to tens-of-millions of logs to sightings per day.

I recommend the ability to load files as jsonl, which are newline delimited json blobs for sightings. You can then setup sidecards/forwarders (i am using https://vector.dev) to append to a file within the sightings ecosystem

mehaase commented 2 years ago

Great idea, thank you! I will include this in the planning for our next iteration.