Azure / azure-diagnostics-tools

Plugins and tools for collecting, processing, managing, and visualizing diagnostics data and configuration
98 stars 93 forks source link

One logstash input for Azure blob is slower than other #214

Open arunp-motorq opened 4 years ago

arunp-motorq commented 4 years ago

Issue with Logstash input for Azure blob

I have one instance of logstash for reading data from blob storage. Although logs are in the same container I have 2 major folder structure for logs from two different processes. Blob structure is something like this Blob

My logstash blob config looks like this

`azureblob { storage_account_name => 'folder1' storage_access_key => '' container => 'logs' id => 'jobs1' blob_list_page_size => 150 file_chunk_size_bytes => 8088608 registry_create_policy => 'resume' path_filters => 'folder1/2020 /*/.csv' }

azureblob { storage_account_name => 'folder2' storage_access_key => '' container => 'logs' id => 'jobs1' blob_list_page_size => 150 file_chunk_size_bytes => 8088608 registry_create_policy => 'resume' path_filters => 'folder2/2020 /*/.csv' }`

Heap is around 3G and cpu usage is at 70-80%.

I run only one instance of logstash. Issue is logs from folder2 are processed much faster than logs from folder1. Folder2 is days ahead of folder1. ( This is catch up scenario. Am reading logs from start of this month) How do I debug this ?

pinochioze commented 4 years ago

Hi Arun, I think your concern is due to the number of blob in each folder (you can get this number by using CLI or Ms Azure Storage Explorer), the procedure of this plugin is:

  1. get the list of all the blobs in the container
  2. Compare the list with the files in "path_filter" then get the list which matched
  3. Get 1 blob in the list of matched blobs base on Generation algorthm and offset of the blob So there are many blobs in the list of matched blobs have to wait to the next loop of the process