ganglia / ganglia-web

Ganglia Web Frontend
BSD 3-Clause "New" or "Revised" License
316 stars 170 forks source link

graph.php giving zero-length output without any indication of error #381

Open mvpel opened 3 months ago

mvpel commented 3 months ago

I'm running Ganglia-Web version 3.7.6-48.el8 on a Red Hat 8 system. The php package is 7.2.24-1, RHEL8's latest avaialble. Recently all the graphs in our web interface are showing up broken, and I'm having trouble getting to the bottom of it. For example, the following URL:

http://_hostname_/ganglia/graph.php?r=hour&z=xlarge&c=Desktops&h=adcsn002&jr=&js=&v=99.5&m=cpu_idle&vl=%25&ti=cpu_idle

... delivers zero bytes, and thus a broken image. I've confirmed that the RRDs are updating properly. In /var/lib/ganglia/rrds/Desktops/adcsn002/cpu_idle.rrd, which would be referenced by this HTTP query, doing an rrdtool dump shows valid data for the past-hour timeframe which this one-hour graph call should be collecting:

        <rra>
                <cf>AVERAGE</cf>
                <pdp_per_row>1</pdp_per_row> <!-- 15 seconds -->

                <params>
                <xff>5.0000000000e-01</xff>
                </params>
                <cdp_prep>
                        <ds>
                        <primary_value>9.7486666667e+01</primary_value>
                        <secondary_value>9.8806666667e+01</secondary_value>
                        <value>NaN</value>
                        <unknown_datapoints>0</unknown_datapoints>
                        </ds>
                </cdp_prep>
                <database>
                        <!-- 2024-07-29 09:57:15 MST / 1722272235 --> <row><v>9.9200000000e+01</v></row>
...
                        <!-- 2024-07-30 09:24:00 MST / 1722356640 --> <row><v>9.9300000000e+01</v></row>
                        <!-- 2024-07-30 09:24:15 MST / 1722356655 --> <row><v>9.9593333333e+01</v></row>
                        <!-- 2024-07-30 09:24:30 MST / 1722356670 --> <row><v>9.9406666667e+01</v></row>
                        <!-- 2024-07-30 09:24:45 MST / 1722356685 --> <row><v>9.9500000000e+01</v></row>
                        <!-- 2024-07-30 09:25:00 MST / 1722356700 --> <row><v>9.9453333333e+01</v></row>
...
                        <!-- 2024-07-30 10:24:15 MST / 1722360255 --> <row><v>9.9140000000e+01</v></row>
                        <!-- 2024-07-30 10:24:30 MST / 1722360270 --> <row><v>9.7193333333e+01</v></row>
                        <!-- 2024-07-30 10:24:45 MST / 1722360285 --> <row><v>9.7486666667e+01</v></row>
                </database>
        </rra>

The structure and framework of the overall site appears normally, just the graph images are showing the "broken image" icon.

The access_log for the Apache 2.4.37-62 server shows a 200 result for graph.php calls while delivering the page, and nothing appears in error_log. This shows the above query, as you can see the HTTP result code is 200, success:

10.x.x.x - - [30/Jul/2024:10:22:32 -0700] "GET /ganglia/graph.php?r=hour&z=xlarge&c=Desktops&h=adcsn002&jr=&js=&v=99.5&m=cpu_idle&vl=%25&ti=cpu_idle HTTP/1.1" 200 - "http://hostname/ganglia/graph_all_periods.php?c=Desktops&h=adcsn002&r=hour&z=default&jr=&js=&st=1722360114&v=99.5&m=cpu_idle&vl=%25&ti=cpu_idle&z=large" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36"

Looking over the graph.php code, it seems that an empty result from the RRD query would create a valid image of an empty graph, based on rrdtool_graph_merge_args_from_json(), but I'm not getting any image data at all - zero-size response according to curl. Should I be examining graph_all_periods.php, shown in the access_log output above, as well?

Permissions on the RRD files are fine, and as I previously mentioned gmetad is continuing to update them as it gathers gmond data, and there's no restrictions on the "apache" user to access the RRD files - I would expect a problem there to show up in error_log.

I do see a range of "NaN" values in older spans of time over the past few days, and I see NaN in the tag in the dump above, which may have come about following some patching work, but every data point in the r=hour time span is showing expected data.

I'd appreciate any input anyone can offer as to how to go about getting to the bottom of this issue. I'm not quite clear on how to track what graph.php is doing internally, or determine if it's not getting the data it expects.

mvpel commented 2 months ago

I was able to drill down to find the root cause. There's a couple of potential pitfalls here.

The tempnam() call to create the script file used to run the rrdtool command and redirect its output is, on our system, returning an empty string when "/tmp" is used as the base directory. I presume that this is some sort of security feature in Apache or PHP (or both) preventing use of an unexempted filesystem. I haven't gotten to the bottom of this yet. It works fine when I give it the Ganglia-Web base directory specified in the tag in the Apache config so we're using that as a workaround.

Reproducer:

        <?php
                echo '<p>Hello World</p>';
                print_r("<p>Calling tempnam...</p>");
                $tf = tempnam(".", "hello-world.");
                print_r("<p>tempnam with . gave us '$tf'</p>");
                $tf = tempnam("/tmp", "hello-world.");
                print_r("<p>tempnam with /tmp gave us '$tf'</p>");
        ?>

The output is:

Hello World
Calling tempnam...
tempnam with . gave us '/var/www/html/ganglia/hello-world.RZuzM4'
tempnam with /tmp gave us ''

If anyone can offer a suggestion on an Apache or PHP config to allow for /tmp I'd appreciate it. Still digging... This issue also impacts the hardcoded /tmp in the tempnam() call for ganglia-graph-json.

However:

There's another issue looming that will impact this approach to running rrdtool via a script in /tmp: the latest Defense Information Security Agency Security Technical Implementation Guide for Red Hat 8, finding V-230513 requires "noexec" on the /tmp filesystem. Other findings apply noexec restrictions to /var/tmp, /dev/shm, and home directories. This means that the tempnam() script file, even if successfully created, will not be able to be exec()ed on a system configured in compliance with the STIG.

I suspect in the face of this it would be necessary to hand off the rrdtool command line directly to shell_exec() or the like, rather than passing it through a shell script with an shell-mediated output.