dadoonet / fscrawler

Elasticsearch File System Crawler (FS Crawler)
https://fscrawler.readthedocs.io/
Apache License 2.0
1.34k stars 297 forks source link

Add Elastic Java Agent instrumentation #1244

Open philippkahr opened 3 years ago

philippkahr commented 3 years ago

Is your feature request related to a problem? Please describe.

Many users of this scrawler run it as a scheduled tabk, docker container, or 24x7. Currently you have to resort to external mechanics e.g. metricbeat watching the process to gain insights on the uptime, cpu usage, memory and so on.

Since this is a Java application it would be a great way to introduce the Elastic APM Java Agent to that. This would take care of a lot of the entire monitoring requests like #987.

Describe the solution you'd like

Use the Elastic APM java agent to instrument this cool application and show us all the nitty gritty details and inner workings of this application.

Describe alternatives you've considered

Running and configuring the java agent like that. It works maybe we can tune the trace_methods used packages, so that the agent is not consuming that much CPU% for instrumentation.

FS_JAVA_OPTS="-Delastic.apm.trace_methods=fr.pilato.elasticsearch.*,org.apache.poi.*,org.apache.tika.* -javaagent:/Users/philippkahr/Downloads/fscrawler-es7-2.7/elastic-apm-agent.jar -Delastic.apm.service_name=fscrawler -Delastic.apm.server_url=https://......apm.westeurope.azure.elastic-cloud.com:443 -Delastic.apm.secret_token=..." ./bin/fscrawler campus
Screenshots Screenshot 2021-08-31 at 15 52 15 Screenshot 2021-08-31 at 15 51 59 Screenshot 2021-08-31 at 15 51 49 Screenshot 2021-08-31 at 15 52 25
SylvainJuge commented 3 years ago

I'd be happy to help here, the first step would be to define what should be instrumented as "transactions" in this application, then using the public API annotations could be a first step.

Using a very broad trace_methods is definitely not recommended, and should only be kept to define what are the transactions. To create spans for unsupported frameworks (like here the tika framework), using the profiler to create inferred spans when needed is also probably the best option (see https://www.elastic.co/guide/en/apm/agent/java/current/java-method-monitoring.html)

SylvainJuge commented 3 years ago

For example, in the screenshot that you provided, what are the things that are instrumented as transactions in Kibana ? Can you provide a screenshot of the Transactions tab @philippkahr ?

philippkahr commented 3 years ago

The instrumented transactions are for example those:

I don#t find them really useful. https://user-images.githubusercontent.com/12175559/131536625-eb407fb8-fc04-4614-ba83-5189959e3df3.mov

From a OPS perspective I would be most interested in:

If the FScrawler is crawling files from remote, I would like those remotes to appear in the services map, like this.

image

Additionally, I think a breakdown by job is interesting. If I instrument fscrawler, maybe every job should appear as it's own individual service? Like this? Then I can easily use machine learning or write easy alerts within Rules and Alerting.

image