DNS-OARC / ripeatlas

Go bindings for RIPE Atlas API
MIT License
11 stars 12 forks source link

Support for checking error field on traceroute.Reply #29

Open jmeggitt opened 1 year ago

jmeggitt commented 1 year ago

Issue

RIPE Atlas probes will occasionally emit hop replies containing an "error" field. This behavior is not documented for any firmware version on https://atlas.ripe.net/docs/apis/result-format/. Upon some investigation, some of these errors can be attributed to the following probe measurement code. https://github.com/RIPE-NCC/ripe-atlas-probe-measurements/blob/master/eperd/traceroute.c#L636-L643 According to git blame, this has and continues to be part of the probe behavior for over 10 years now.

This functionality to detect this field is necessary to verify whether past and current traceroute measurement data is effected.

Effected Measurement Examples

I dumped every measurement from one of the RIPE Atlas hourly traceroute dump files (traceroute-2022-10-14T0400.bz2) that contained this field found. This data is stored as newline delimited JSON and can be found at https://gist.github.com/jmeggitt/11fba9f7fa539e8a4fdae1e231ec8fa1. This appears to be extremely rare and only occurred in 1690 of the 8,912,306 traceroute measurements in that data file (0.019%).

Values

As far as I have seen, this field is always a string when it appears. This appears to be consistent with the probe code above.

count   value
      9 "bind failed: Address already in use"
     11 "bind failed: Address not available"
     47 "bind failed: Cannot assign requested address"
    334 "bind failed: Invalid argument"
    804 "sendto failed: Network is unreachable"
    104 "sendto failed: Network unreachable"
    364 "sendto failed: Operation not permitted"
     17 "sendto failed: Permission denied"

Examples

Here are a couple measurements I arbitrarily chose to show off what they look like in the context of the data.

{
  "af": 6,
  "dst_addr": "2001:500:2::c",
  "dst_name": "2001:500:2::c",
  "endtime": 1665721544,
  "from": "2001:bc8:62c:2545::1",
  "fw": 5020,
  "lts": 11,
  "msm_id": 6011,
  "msm_name": "Traceroute",
  "mver": "2.2.0",
  "paris_id": 2,
  "prb_id": 1000410,
  "proto": "UDP",
  "result": [
    {
      "hop": 1,
      "result": [
        {
          "error": "sendto failed: Operation not permitted"
        }
      ]
    }
  ],
  "size": 40,
  "src_addr": "2001:bc8:62c:2545::1",
  "timestamp": 1665721544,
  "type": "traceroute"
}
{
  "af": 4,
  "dst_addr": "46.101.130.201",
  "dst_name": "46.101.130.201",
  "endtime": 1665722912,
  "from": "170.39.226.151",
  "fw": 5040,
  "group_id": 29556742,
  "lts": 1,
  "msm_id": 29556742,
  "msm_name": "Traceroute",
  "mver": "2.4.1",
  "paris_id": 14,
  "prb_id": 6927,
  "proto": "ICMP",
  "result": [
    {
      "hop": 1,
      "result": [
        {
          "x": "*"
        },
        {
          "x": "*"
        },
        {
          "x": "*"
        }
      ]
    },
    {
      "hop": 2,
      "result": [
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }
  ],
  "size": 48,
  "src_addr": "170.39.226.151",
  "timestamp": 1665722900,
  "type": "traceroute"
}
{
  "af": 6,
  "dst_addr": "2a00:74c0:0:2::20",
  "dst_name": "2a00:74c0:0:2::20",
  "endtime": 1665722173,
  "from": "2a05:f6c7:3853:0:eade:27ff:fe69:dd4e",
  "fw": 5070,
  "group_id": 25639804,
  "lts": 38,
  "msm_id": 25639804,
  "msm_name": "Traceroute",
  "mver": "2.6.1",
  "paris_id": 7,
  "prb_id": 22203,
  "proto": "ICMP",
  "result": [
    {"hop":1,"result":[{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.948,"size":96,"ttl":255},{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.828,"size":96,"ttl":255},{"from":"2a05:f6c7:3853:0:1e74:dff:fec3:e2f8","rtt":0.759,"size":96,"ttl":255}]},
    {"hop":2,"result":[{"from":"2a05:f6c0:1::18","rtt":3.658,"size":96,"ttl":63},{"from":"2a05:f6c0:1::18","rtt":4.791,"size":96,"ttl":63},{"from":"2a05:f6c0:1::18","rtt":3.415,"size":96,"ttl":63}]},
    {"hop":3,"result":[{"from":"2a05:f6c0:2:23::1","rtt":5.209,"size":96,"ttl":62},{"from":"2a05:f6c0:2:23::1","rtt":4.848,"size":96,"ttl":62},{"from":"2a05:f6c0:2:23::1","rtt":5.137,"size":96,"ttl":62}]},
    {"hop":4,"result":[{"from":"2001:6c8:81:100::1a1","rtt":4.867,"size":96,"ttl":252},{"from":"2001:6c8:81:100::1a1","rtt":18.54,"size":96,"ttl":252},{"from":"2001:6c8:81:100::1a1","rtt":4.106,"size":96,"ttl":252}]}, 
    {"hop":5,"result":[{"from":"2001:6c8:40::1e","rtt":4.761,"size":96,"ttl":250},{"from":"2001:6c8:40::1e","rtt":4.698,"size":96,"ttl":250},{"from":"2001:6c8:40::1e","rtt":5.275,"size":96,"ttl":250}]},
    {
      "hop": 6,
      "result": [
        {
          "x": "*"
        },
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }
  ],
  "size": 48,
  "src_addr": "2a05:f6c7:3853:0:eade:27ff:fe69:dd4e",
  "timestamp": 1665722169,
  "type": "traceroute"
}
jelu commented 1 year ago

This looks more like a bug in the Atlas code to me. Have you reported this to RIPE?

For example:

    {
      "hop": 2,
      "result": [
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }

Should really be:

    {
      "hop": 2,
      "error": "sendto failed: Network is unreachable"
    }

And that would be correct according to doc for v4400.

jmeggitt commented 1 year ago

I was generally thinking the same thing. However, one complication is around how to handle an error that occurs in a later reply. If a hop already has one or two valid replies, then it may not make sense to mark the entire hop as having errored.

    {
      "hop": 6,
      "result": [
        {
          "x": "*"
        },
        {
          "error": "sendto failed: Network is unreachable"
        }
      ]
    }
jmeggitt commented 1 year ago

I have not yet raised an issue related to this field with RIPE Atlas. As it stands there is a decent lead time on a fix being created, implemented, and deployed to their probes. Even if they are can finish the deployment of a patch by tomorrow, nearly all of the previous measurement data is still effected so it would be helpful to have a way of identifying it until a more permanent solution is implemented.

I am also somewhat unclear on if this is actually a bug in the probe software or the API documentation. If we assume it is intentional then the 3 cases for Timeout, Error, and Reply would closely match the ping measurement results structure.

jelu commented 1 year ago

I'd like to not deviate from the documentation, please try and get them to update that first.

jelu commented 1 year ago

@jmeggitt Any progress on updating the documentation?

jelu commented 1 year ago

@jmeggitt ping?

jmeggitt commented 1 year ago

Sorry about that. I saw your previous ping, but got distracted with other work. I have notified them about the issue, but due to limited resources to address the issue I would not expect to see any updates to the documentation anytime soon. I have not been pushing the issue other than bringing it to their attention, so I imagine it is still in their backlog.

This issue can be found on https://github.com/RIPE-NCC/ripe-atlas-probe-measurements/issues/14. However, if you want to get in contact with them or ask about the status of the issue then you will likely have more success directly emailing them at atlas@ripe.net to create a ticket in their system.