Closed duncandewhurst closed 2 years ago
https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/13 has some examples of networks with large numbers of links which might require streaming
For GeoJSON, I don't think we need to worry about packaging. As long as the network identifier is included in each feature's properties, a single GeoJSON file could contain nodes or links from multiple networks.
On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.
Edit: See https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/75 for a proposal on streaming/paginating individual networks.
@kindly, @lgs85 it would be great to get your thoughts on the proposal below.
On reflection, I think a bigger issue is API design and streaming support for links and nodes, since each dataset is likely to contain a small number of potential large networks, rather than a large number of small networks.
Whilst this is true, the standard still needs to specify how to package multiple networks, to avoid a situation in which publishers mint a variety of packaging formats, which would make authoring tools that consume OFDS data difficult.
Based on the discussion in https://github.com/open-contracting/standard/issues/1084, offer two packaging formats each for the JSON and GeoJSON publication formats:
The approach to packaging multiple networks in CSV format will depend on the tool chosen in https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/14.
A top-level JSON object with an array of Network
objects in .networks
and, for data published via API, a pages
object based on the pagination approach from OCDS. Note that it is named pages
to avoid a clash with links
.
{
"networks": [
{...},
{...}
],
"pages": {
"next": "",
"prev": ""
}
}
The preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources
should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes
or a links
array, respectively:
{
"nodes": [
{...},
{...},
{...}
],
"pages": {
"next": "",
"prev": ""
}
}
{
"links": [
{...},
{...},
{...}
],
"pages": {
"next": "",
"prev": ""
}
}
A JSON Lines file with one network per line:
{...}
{...}
{...}
The preferred approach is to publish embedded nodes and links. If an individual network is too large to load into memory, .relatedResources
should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively.
Publish separate files/endpoints for nodes and links, each structured as a top-level FeatureCollection object according to the GeoJSON transformation specification. Each file may contain features from multiple networks. The network each feature relates to is identified by its .properties.network.id
.
For data published via API, add a top-level pages
object based on the pagination approach from OCDS:
{
"type": "FeatureCollection",
"features": [
{...},
{...}
],
"pages": {
"next": "",
"prev": ""
}
}
Separate Newline-delimted GeoJSON files for nodes and links, with one feature per line structured according to the GeoJSON transformation specification:
{...}
{...}
{...}
Do not support packaging multiple networks. Instead, publish networks one at a time, i.e. publish a JSON file for each network containing a top-level Network
object. For data published via API, use .relatedResources
(see https://github.com/Open-Telecoms-Data/open-fibre-data-standard/issues/75) to provide links to the next and previous networks in the series:
{
"relatedResources": [
{
"href": "",
"rel": "next"
},
{
"href": "",
"rel": "prev"
}
]
}
As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to return in a single API response, .relatedResources
should be used to provide links to separate endpoints for nodes and links, which must return a top-level JSON object with a nodes
or a links
array, respectively.
Pros:
Cons:
A ZIP or GZIP file containing a JSON file for each network.
As in the proposal, the preferred approach is to publish embedded nodes and links. For networks that are too large to load into memory, .relatedResources
should be used to provide links to separate bulk downloads for nodes and links, which must be formatted as JSON Lines files with one node or link per line, respectively. The reason for choosing JSON Lines over ZIP/GZIP for bulk downloads of nodes and links is that networks can contain upwards of 100,000 links so the ZIP/GZIP file could expand into upwards of 100,000 files.
Cons:
This looks fine to me. The pages approach in GEOJSON format looks odd and may confuse geo users expecting to have all the data in one go and if they do not they may not be able to traverse through the links. However, I see no real harm in it.
Thanks, @kindly. The pages
key in the GeoJSON format is only for data that needs paginating. If the data is small enough, it can be served whole. We can make that clear in the documentation and guidance.
In case it's of interest, ArcGIS uses pagination to serve GeoJSON data (example). If the data is greater than one page, no link to the next page is provided. Instead properties.exceededTransferLimit
is set to True
and the user needs to construct the next link using the resultOffset
URL parameter and (presumably) some knowledge of what the transfer limit is / how many pages are returned per query.
I think that this approach looks fine. As discussed, it'll be important to make very clear in the guidance that this is unlikely to be used in the majority of cases
The reference documentation has been updated to reflect the proposal in this issue: https://open-fibre-data-standard.readthedocs.io/en/latest/reference/publication_formats.html
This issue will remain open against the beta milestone to gather feedback from the alpha consultation.
We've not heard any further feedback on this issue so I'm going to close it for now.
From the data stewardship, publication formats and access methods consultation document:
When designing the format, we'll need to consider streaming. See https://github.com/open-contracting/standard/issues/1084 for a related discussion. We'll also need to consider packaging for the CSV and GeoJSON formats.