Should the result of a merging multiple records be a single record?

stevesong commented 6 months ago

I am wondering whether the result of a merge of a node or span should be a single record as is currently implemented with multiple network operators recorded in a single record or multiple individual records but which would share a common Node ID(s) and Lat/Long. In particular I am thinking about the ability to disaggregate the data and at what point the node or span ID becomes canonical for a given operators over a given infrastructure.

duncandewhurst commented 5 months ago

I prepared some examples to illustrate the implications of the different approaches (see below). @stevesong, I would need some more information on the use cases that you're thinking about to assess the approaches fully. Still, hopefully, the examples and accompanying notes help to progress the discussion.

Looking back at the scoping work, the overarching goal was to "build a tool to consolidate and deduplicate multiple OFDS datasets" so I think that a 'single-record' output is more in line with the expectations for the tool. I don't see any references to disaggregation in the use cases, but I think it can be achieved even with a single-record output by expanding the provenance object, e.g. to include copies of the 'source' features. Or, by the user retaining copies of the source data and looking up the identifiers referenced in the provenance object.

My main concerns with the multi-record approach are

Non-unique node IDs within the same network do not conform to the OFDS schema and would require a major rework of the associated tooling.
I think it is conceptually confusing to have an (OFDS) network with multiple representations of the same (real) feature - deduplication seems more intuitive to me.

Regarding canonical node or span identifiers, I think that the only identifiers that we could reasonably expect to be canonical are those assigned by the operators themselves (though we did not see much evidence of that in the supply-side research) rather than any that are assigned either by a tool or a regulator and passed back to operators. If preserving the original identifiers and values is a priority I would be more inclined to investigate providing an output that preserves the source data in full and annotates it to link alternative representations of the same feature, as I've sketched out in the 3rd option below. That type of approach allows full flexibility in what users can do with the output, but the output would require further processing by users to achieve the goal of the tool so I don't think it is something we should switch to this late in the development.

Those are my initial thoughts - happy to discuss further!

Worked examples

These examples are in the canonical OFDS JSON format, rather than GeoJSON, for readability.

Source data

Two representations of the same network, with differing properties and geometries:

{
  "id": "A",
  "nodes": [
    {
      "id": "1",
      "name": "Manchester",
      "location": {
        "type": "Point",
        "coordinates": [
          0,
          0
        ],
      },
      "physicalInfrastructureProvider": {
        "id": "1",
        "name": "Open Reach"
      },
      "networkProviders": [
        {
          "id": "2",
          "name": "BT"
        }
      ]
    }
  ]
}

{
  "id": "B",
  "nodes": [
    {
      "id": "1",
      "name": "Greater Manchester",
      "location": {
        "type": "Point",
        "coordinates": [
            1,
            1
        ]
      },
      "physicalInfrastructureProvider": {
        "id": "1",
        "name": "Talk Talk"
      },
      "networkProviders": [
        {
          "id": "2",
          "name": "Talk Talk"
        }
      ]
    }
  ]
}

Single 'record' (current approach)

Consolidated with network A as the primary network:

{
  "id": "C", // A new network identifier (actually a UUID)
  "nodes": [
    {
      "id": "1", // A new node identifier to avoid clashes (actually a UUID, but could equally be incremental)
      "name": "Manchester", // The name from the primary network
      "location": { // The location from the primary network
        "type": "Point",
        "coordinates": [
          0,
          0
        ]
      },
      "physicalInfrastructureProvider": { // The physical infrastructure provider from the primary network
        "id": "A-1",
        "name": "Open Reach"
      },
      "networkProviders": [ // The network providers from both networks
        {
          "id": "A-2",
          "name": "BT"
        },
        {
          "id": "B-3",
          "name": "Talk Talk"
        }
      ],
      "provenance": {
        // see https://github.com/Open-Telecoms-Data/ofds_consolidation_tool/blob/main/docs/howto.md#output
      }
    }
  ]
}

Pros:

Output is easy to visualise and is of the same 'type' as the source data, i.e. a network with a single representation of the one (real) node.
Output is intuitive, i.e. it's a network with a single node, reflecting reality

Cons:

Need to inspect provenance object to 'disaggregate' data
Only the node identifiers from the secondary network are preserved in the provenance object

One network, multiple 'records' (Steve's proposed approach)

Consolidated with network A as the primary network using Steve's proposed approach:

multiple individual records but which would share a common Node ID(s) and Lat/Long.

{
  "id": "C", // A new network identifier (actually a UUID)
  "nodes": [
    {
      "id": "A-1", // A new node identifier
      "name": "Manchester", // The name from the primary network
      "location": { // The location from the primary network
        "type": "Point",
        "coordinates": [
          0,
          0
        ]
      },
      "physicalInfrastructureProvider": { // From the primary network
        "id": "1",
        "name": "Open Reach"
      },
      "networkProviders": [ // From the primary network
        {
          "id": "2",
          "name": "BT"
        }
      ],
      "provenance" {
        ...
      }
    },
    {
      "id": "A-1", // Same as the above node identifier
      "name": "Greater Manchester", // The name from the secondary network
      "location": { // The location from the *primary* network
        "type": "Point",
        "coordinates": [
            0,
            0
        ]
      },
      "physicalInfrastructureProvider": { // From the secondary network
        "id": "1",
        "name": "Talk Talk"
      },
      "networkProviders": [ // From the secondary network
        {
          "id": "2",
          "name": "Talk Talk"
        }
      ],
      "provenance": {
        ...
      }
    }               
  ]
}

Pros:

Data from both source networks is preserved, with the exception of (both) node identifiers and the location from the secondary network

Cons:

Not vaid OFDS data (non-unique node identifiers)
Confusing to have multiple representations of one (real) node
Confusing that one representation is from network A whilst the other mixes properties from network B with location from network A
Difficult to visualise (overlapping nodes)

Multiple networks, linked features (another alternative!)

Annotate the features in the source networks with links to their alternative representations.

{
  "id": "A",
  "nodes": [
    {
      "id": "1",
      "name": "Manchester",
      "location": {
        "type": "Point",
        "coordinates": [
          0,
          0
        ],
      },
      "physicalInfrastructureProvider": {
        "id": "1",
        "name": "Open Reach"
      },
      "networkProviders": [
        {
          "id": "2",
          "name": "BT"
        }
      ]
    },
    "sameAs": {
      "network": "B",
      "node": "1"
      // Could also include `confidence`, `similarFields` and `manual` properties from existing `provenance` object
    }
  ]
}

{
  "id": "B",
  "nodes": [
    {
      "id": "1",
      "name": "Greater Manchester",
      "location": {
        "type": "Point",
        "coordinates": [
            1,
            1
        ]
      },
      "physicalInfrastructureProvider": {
        "id": "1",
        "name": "Talk Talk"
      },
      "networkProviders": [
        {
          "id": "2",
          "name": "Talk Talk"
        }
      ]
    },
    "sameAs": {
      "network": "A",
      "node": "1"
    }
  ]
}

Pros:

All original identifiers and values are preserved
Users can do whatever they want with the output (merging, consolidating, deduplicating)

Cons:

Extra processing required to produce a single consolidated or deduplicated network

stevesong commented 5 months ago

Thanks Duncan, for taking the time to think through this. I take your point about the resulting non-standard implementation of OFDS if we take the multiple records approach. And thanks too for considering an alternative to both. I suggest we carry on with the current single record approach. I was worried about regulators being able to add value to the aggregated data source and feed it back to operators but I am sure there are effective, non-automated ways of doing this.

Regarding canonical records, I fully agree that operators would/should be the authority regarding IDs for spans and nodes. In a future where there is some kind of version control for network records, perhaps permanent IDs may be possible, or perhaps it isn't important.

Looking at your examples, one area where I can see it being important is at the level of operators and operator IDs. We would want to be able to aggregate operator networks across networks and countries. Doing that would require consistent use of a network operator ID.

duncandewhurst commented 5 months ago

Thanks for the feedback :slightly_smiling_face:

:+1: to using consistent organisation identifiers. I omitted the organisations array from the above examples, but that is what the organisation references in physicalInfrastructureProvider and networkProviders refer to. It includes an identifier property to disclose the organisation's legal identifier, e.g. for a UK company like Talk Talk that is registered with Companies House:

{
  "id": "01599423",
  "scheme": "GB-COH" // Companies House in this example
}

Based on experience from other standards, this area needs to be flagged quite early with implementers as they might not be collecting legal identifiers for the organisations in their data as a matter of course.

Open-Telecoms-Data / ofds_consolidation_tool