Graylog2 / graylog2-server

Free and open log management
https://www.graylog.org
Other
7.33k stars 1.06k forks source link

Couldn't refresh data adapter #4748

Closed ceveyp closed 5 years ago

ceveyp commented 6 years ago

Expected Behavior

Lookup tables should be updated.

Current Behavior

In the log file I get messages like this:

2018-04-21T16:20:31.544-04:00 ERROR [LookupDataAdapter] Couldn't refresh data adapter lladdr-lookup/5aa40ca39d992d72500b1b83/@7e3d1a53 java.lang.NullPointerException: null

Also when I create a lookup table or insert rows in an existing CSV file, the lookup doesn't work.

Possible Solution

Is it related to my version of Java? I'm just using the current latest version of Java, 1.8.0_162. Is Graylog not compatible with this version of Java?

Context

Your Environment

joschi commented 6 years ago

@ceveyp Please post the complete error message including some context from the logs of your Graylog node and provide the complete configuration of the lladdr-lookup data adapter.

ceveyp commented 6 years ago

Hi. You said to post the context around the error message but actually there really is no context, the log file is just filled with these types of errors messages one after the other.

Here is an example error message:

2018-05-04T22:07:13.195-04:00 ERROR [LookupDataAdapter] Couldn't refresh data adapter lladdr-lookup/5aa40ca39d992d72500b1b83/@753740cb java.lang.NullPointerException: null at org.graylog2.lookup.adapters.CSVFileDataAdapter.doRefresh(CSVFileDataAdapter.java:102) ~[graylog.jar:?] at org.graylog2.plugin.lookup.LookupDataAdapter.refresh(LookupDataAdapter.java:89) ~[graylog.jar:?] at org.graylog2.lookup.LookupDataAdapterRefreshService.lambda$schedule$0(LookupDataAdapterRefreshService.java:142) ~[graylog.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_162] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_162] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_162] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_162] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

Here is my configuration for lladdr-lookup data adapter: https://imgur.com/a/xhDvKZD https://imgur.com/oJa33kz

Here is what my hosts.csv file looks like: "ipaddr","lladdr","hostname" "127.0.0.1","00:00:00:00:00:00","hostname"

Also one thing I should say, these lookup tables use to work. Maybe I accidentally updated Java. But they don't work anymore. When I create a new lookup table I get the same problem.

dustintennill commented 6 years ago

Tagging on this issue - I encountered this same problem.

The lookup tables have been functional in the past, and I didn't notice an issue until I had to restart all of our Graylog servers over the weekend. It appears most of our lookup tables were not working, and there are messages in the log indicating a problem.

We have a lookup table that's used many times per second, and I was seeing this message in server.log: "WARN [LookupTableService] Lookup table does not exist"

This table did exist, and initial inspection showed no obvious configuration issues.

I did the following:

  1. Stopped graylog-server services.
  2. Cleared server.log.
  3. Checked the CSV files in question to make sure they were present on the node, readable and had valid content.
  4. Rebooted the box.
  5. Confirmed I had the same error messages as you posted from server.log.

I updated java, and that seemed to fix the issue for me:

  1. I applied os updates (including Java) on the node. This took java to jdk-8u171.
  2. Stopped graylog-server service.
  3. Cleared the server.log.
  4. Checked the CSV files in question to make sure they were present on the node, readable and had valid content.
  5. My server.log is clear of issues now.

This fixed 2 of our 5 nodes. I am working down the list to see if that clears all of the issues up.

ceveyp commented 6 years ago

I seriously can't believe someone would use Graylog in a production environment. Honestly I spent about 3 weeks troubleshooting Graylog and I just gave up on it in the end. It is infinitely more work than it's worth.

ceveyp commented 6 years ago

This is open source software in a nutshell: it takes forever to deploy, you have to go right down in to the nitty-gritty technical details, and at the same time the documentation is atrocious, and then after you finally set it up the way you want, you run in to all these weird problems for no fault of your own, with no meaningful way troubleshoot.

dustintennill commented 6 years ago

Updating java didn't fix my issue - just took a few minutes to come back. After looking at my issue for a bit, I have a difference compared to what you posted. I see messages warning me that a particular lookup table doesn't exist.

"WARN [LookupTableService] Lookup table does not exist"

The lookup table does exist in the GUI, and I verified I can see it in mongodb.

> db.lut_tables.find({name:"alexa-top-million"});
{ "_id" : ObjectId("5b3a72a088259303bba683d4"), "title" : "Alexa Top Million", "description" : "", "name" : "alexa-top-million", "cache" : ObjectId("5b3a727588259303bba683a7"), "data_adapter" : ObjectId("5b3a8c7388259303bba69dd7"), "default_single_value" : "", "default_single_value_type" : "NULL", "default_multi_value" : "", "default_multi_value_type" : "NULL" }`

Here is a snip of logs showing what looks odd to me:

2018-07-09T14:16:31.834-04:00 INFO  [LookupTableService] Data Adapter alexa-top-million-csv/5b3a8c7388259303bba69dd7 [@138bb8ac] RUNNING
2018-07-09T14:16:31.834-04:00 INFO  [LookupDataAdapterRefreshService] Adding job for <alexa-top-million-csv/5b3a8c7388259303bba69dd7/@138bb8ac> [interval=60000ms]
2018-07-09T14:16:32.376-04:00 INFO  [LookupTableService] Data Adapter tor-exit-node/5a71f27186b3fb0d9ae0648c [@7182b26a] RUNNING
2018-07-09T14:16:32.377-04:00 INFO  [LookupDataAdapterRefreshService] Adding job for <tor-exit-node/5a71f27186b3fb0d9ae0648c/@7182b26a> [interval=3600000ms]
2018-07-09T14:16:38.208-04:00 INFO  [NetworkListener] Started listener bound to [gl02:12900]
2018-07-09T14:16:38.209-04:00 INFO  [HttpServer] [HttpServer] Started.
2018-07-09T14:16:38.210-04:00 INFO  [JerseyService] Started REST API at <https://gl02:12900/>
2018-07-09T14:16:40.211-04:00 INFO  [NetworkListener] Started listener bound to [gl02:9000]
2018-07-09T14:16:40.212-04:00 INFO  [HttpServer] [HttpServer-1] Started.
2018-07-09T14:16:40.213-04:00 INFO  [JerseyService] Started Web Interface at <https://gl02:9000/>
2018-07-09T14:18:32.543-04:00 WARN  [LookupTableService] Lookup table <alexa-top-million> does not exist

At this point I believe that none of my lookup tables are working.

ceveyp commented 6 years ago

Yeah I reinstalled Graylog, I updated Java, I recreated the lookup table it made no difference.

dustintennill commented 6 years ago

I have more information.

Restating my current issue: Some of my previously working lookup tables have ceased to function. A review of my server.log file complains that tables do not exist, but I can see them in the gui and in mongodb. I renamed a lookup table and updated the pipelines that consume it, but still get errors in the log.

Here are config screenshots for one of the lookup tables with an issue. screen shot 2018-07-10 at 8 20 43 am screen shot 2018-07-10 at 8 20 29 am screen shot 2018-07-10 at 8 19 47 am

All of the logs are full of: 2018-07-10T08:36:31.736-04:00 WARN [LookupTableService] Lookup table does not exist

I attempted a restart of everything to get clean logs before posting. This proved a little challenging - log files were cleared of useful information after 4-5 seconds of actual traffic.

While tailing the logs this time, I caught this message:

2018-07-10T15:20:52.761-04:00 ERROR [ServerBootstrap] Unable to shutdown properly on time. {STOPPING=[LookupTableService [STOPPING]], TERMINATED=[InputSetupService [TERMINATED], OutputSetupService [TERMINATED], StreamCacheService [TERMINATED], PeriodicalsService [TERMINATED], ConfigurationEtagService [TERMINATED], KafkaJournal [TERMINATED], JerseyService [TERMINATED], BufferSynchronizerService [TERMINATED], JournalReader [TERMINATED]]}
2018-07-10T15:20:52.761-04:00 ERROR [ServerBootstrap] Graylog startup failed. Exiting. Exception was:
java.lang.IllegalStateException: Expected to be healthy after starting. The following services are not running: {STARTING=[LookupTableService [STARTING]], STOPPING=[InputSetupService [TERMINATED]]}
at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.checkHealthy(ServiceManager.java:740) ~[graylog.jar:?]
at com.google.common.util.concurrent.ServiceManager$ServiceManagerState.awaitHealthy(ServiceManager.java:553) ~[graylog.jar:?]
at com.google.common.util.concurrent.ServiceManager.awaitHealthy(ServiceManager.java:312) ~[graylog.jar:?]
at org.graylog2.bootstrap.ServerBootstrap.startCommand(ServerBootstrap.java:149) [graylog.jar:?]
at org.graylog2.bootstrap.CmdLineTool.run(CmdLineTool.java:209) [graylog.jar:?]
at org.graylog2.bootstrap.Main.main(Main.java:44) [graylog.jar:?]

I understand now that the LookupTableService is probably not stopping/starting properly.

Here are the logs from gl03 from when a restart is attempted. https://drive.google.com/file/d/1v72CTLzdTMvbK2d5HWJd9OA0ClKV9mL_/view?usp=sharing

kroepke commented 6 years ago

In https://github.com/Graylog2/graylog2-server/issues/4748#issuecomment-386773124 looks like it fails to refresh the change information for that file. There's code that is supposed to avoid anything being null there, but something is.

The code in question hasn't changed recently, so something else must be going on. It very much sounds like Graylog being unable to read the file (or update its change status), which appears to happen during the first load.

Can you try copying the file to a location where you are absolutely sure it is readable (e.g. /tmp) to help debug this?

Thanks!

bernd commented 6 years ago

@dustintennill Thank you for the detailed information.

According to your log file, the "alexa-top-million" data adapter is starting but never goes into the RUNNING state. All other data adapters first go into STARTING and then into RUNNING.

2018-07-10T15:21:56.021-04:00 INFO  [LookupTableService] Data Adapter alexa-top-million/5b3a6ded5541d73670ae23b8 [@5ded348] STARTING

The lookup table code is currently waiting for every data adapter and cache to start. That means if there is one data adapter that doesn't go into the "RUNNING" state, no lookup table gets started. This is another issue that we need to look into.

But first we should try to get your data adapter running or at least figure out what the problem is. Can you please show us the configuration of your "alexa-top-million" (ID 5b3a6ded5541d73670ae23b8) data adapter?

You can get the configuration via curl on the graylog node:

curl -u admin:<admin-password> http://127.0.0.1:9000/api/system/lookup/adapters/5b3a6ded5541d73670ae23b8

Thank you!

dustintennill commented 6 years ago

Here is the output.

The CSV file in question is one million lines.

{"id":"5b3a6ded5541d73670ae23b8","title":"Alexa Top Million","description":"This is for the Alexa Top Million","name":"alexa-top-million","content_pack":null,"config":{"type":"dsvhttp","type":"dsvhttp","url":"http://sysmgmt.eku.edu/graylog/top-1m.csv","separator":",","line_separator":"\\r\\n","quotechar":"\"","ignorechar":"#","key_column":1,"check_presence_only":false,"value_column":2,"refresh_interval":60,"case_insensitive_lookup":true}}

I just realized that this config is for a version of this lookup table we are not using. We had trouble with the DSV setting and decided to go back to local CSV files until we had time to troubleshoot. We created a new data adapter using local CSV files, and that was working fine. However, I did not delete the unused DSV version of the data adapter at that time.

To sum up what the issue was and how it caught us:
We created a DSV lookup table for the Alexa Top Million data, and it didn't work correctly at first. We gave up, and created a local CSV file and new adapater. That worked, and we moved on. This past weekend, we had reason to restart graylog services. We noted things weren't working correctly, but did not connect that back the Alexa Top Million DSV Data Adapter.

I have deleted the unused adapter and am restarting services.

dustintennill commented 6 years ago

Confirmed - removing the DSV table and restarting services resolved the issues we were having.

I don't know if this resolves the issue @ceveyp was having.

bernd commented 6 years ago

@dustintennill Thank you for the update!

I guess either loading the CSV file from the HTTP server or loading the file into heap memory takes some time in your case. We need to make this more robust to avoid having one faulty or slow data adapter block all lookup tables. I will open another issue for that.

Regarding the NPE @ceveyp reported: I will create a small fix to avoid a potential NullPointerException in the CSV data adapter. If that fixes the issue is hard to say without complete logs, though.

bernd commented 5 years ago

Closing this. The NullPointerException has been fixed and will be available in the upcoming 3.0 release. The other issue with the blocking data adapter has been added to #4524. Please watch that issue for progress on that one. Thank you!

ffabreti commented 4 years ago

I had weird errors like org.graylog2.plugin.lookup.LookupDataAdapter - Couldn't refresh data adapter because cvs file was not UTF-8. So, be warned.

MisderGAO commented 3 years ago

Hello everyone,

Are you sure this bug is resolved? I am using graylog 3.3, this problem still exist when I restart a graylog node or add a new graylog node. To make the error disappeared in logs, I have to:

  1. move csv file to another folder
  2. from data adapter interface, click update.
  3. get back the csv file

Could you please check it ? Thanks