Packaging multiple networks

duncandewhurst commented 2 years ago

From the data stewardship, publication formats and access methods consultation document:

The standard should provide a standardised bulk download format for packaging multiple networks.

When designing the format, we'll need to consider streaming. See https://github.com/open-contracting/standard/issues/1084 for a related discussion. We'll also need to consider packaging for the CSV and GeoJSON formats.

duncandewhurst commented 2 years ago

https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/13 has some examples of networks with large numbers of links which might require streaming

duncandewhurst commented 2 years ago

For GeoJSON, I don't think we need to worry about packaging. As long as the network identifier is included in each feature's properties, a single GeoJSON file could contain nodes or links from multiple networks.

duncandewhurst commented 2 years ago

On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.

Edit: See https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/75 for a proposal on streaming/paginating individual networks.

duncandewhurst commented 2 years ago

@kindly, @lgs85 it would be great to get your thoughts on the proposal below.

On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.

Whilst this is true, the standard still needs to specify how to package multiple networks, to avoid a situation in which publishers mint a variety of packaging formats, which would make authoring tools that consume OFDS data difficult.

Proposal

Based on the discussion in https://github.com/open-contracting/standard/issues/1084, offer two packaging formats each for the JSON and GeoJSON publication formats:

A small file and API response format for files that are small enough to fit into memory or are published via API.
A bulk download format for files that are too large to fit into memory.

The approach to packaging multiple networks in CSV format will depend on the tool chosen in https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/14.

JSON

Small files and API responses

A top-level JSON object with an array of Network objects in .networks and, for data published via API, a pages object based on the pagination approach from OCDS. Note that it is named pages to avoid a clash with links.

{
  "networks": [
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

The preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes or a links array, respectively:

{
  "nodes": [
    {...},
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

{
  "links": [
    {...},
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

Bulk downloads

A JSON Lines file with one network per line:

{...}
{...}
{...}

The preferred approach is to publish embedded nodes and links. If an individual network is too large to load into memory, .relatedResources should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively.

GeoJSON

Small files and API responses

Publish separate files/endpoints for nodes and links, each structured as a top-level FeatureCollection object according to the GeoJSON transformation specification. Each file may contain features from multiple networks. The network each feature relates to is identified by its .properties.network.id.

For data published via API, add a top-level pages object based on the pagination approach from OCDS:

{
  "type": "FeatureCollection",
  "features": [
    {...},
    {...}
  ],
  "pages": {
    "next": "",
    "prev": ""
  }
}

Bulk downloads

Separate Newline-delimted GeoJSON files for nodes and links, with one feature per line structured according to the GeoJSON transformation specification:

{...}
{...}
{...}

Other approaches considered

JSON

Small files and API responses

Do not support packaging multiple networks. Instead, publish networks one at a time, i.e. publish a JSON file for each network containing a top-level Network object. For data published via API, use .relatedResources (see https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/75) to provide links to the next and previous networks in the series:

{
    "relatedResources": [
    {
      "href": "",
      "rel": "next"
    },
    {
      "href": "",
      "rel": "prev"
    }
  ]
}

As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes or a links array, respectively.

Pros:

Simpler for publishers that only need to publish one network

Cons:

Greater number of API calls required to get all the data
Inconsistency between the format of the data returned by endpoints for networks and endpoints for nodes or links.

Bulk downloads

A ZIP or GZIP file containing a JSON file for each network.

As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to load into memory, .relatedResources should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively. The reason for choosing JSON Lines over ZIP/GZIP for bulk downloads of nodes and links is that networks can contain upwards of 100,000 links so the ZIP/GZIP file could expand into upwards of 100,000 files.

Cons:

Inconsistency between the format of the bulk download for networks and the format of the bulk downloads for links and nodes.

kindly commented 2 years ago

This looks fine to me. The pages approach in GEOJSON format looks odd and may confuse geo users expecting to have all the data in one go and if they do not they may not be able to traverse through the links. However, I see no real harm in it.

duncandewhurst commented 2 years ago

Thanks, @kindly. The pages key in the GeoJSON format is only for data that needs paginating. If the data is small enough, it can be served whole. We can make that clear in the documentation and guidance.

In case it's of interest, ArcGIS uses pagination to serve GeoJSON data (example). If the data is greater than one page, no link to the next page is provided. Instead properties.exceededTransferLimit is set to True and the user needs to construct the next link using the resultOffset URL parameter and (presumably) some knowledge of what the transfer limit is / how many pages are returned per query.

lgs85 commented 2 years ago

I think that this approach looks fine. As discussed, it'll be important to make very clear in the guidance that this is unlikely to be used in the majority of cases

duncandewhurst commented 2 years ago

The reference documentation has been updated to reflect the proposal in this issue: https://open-fibre-data-standard.readthedocs.io/en/latest/reference/publication_formats.html

This issue will remain open against the beta milestone to gather feedback from the alpha consultation.

duncandewhurst commented 2 years ago

We've not heard any further feedback on this issue so I'm going to close it for now.

Open-Telecoms-Data / open-fibre-data-standard