Storing metadata separately

Athelena commented 4 years ago

Can we have a discussion around whether we can store metadata in a separate file? There are many file formats that allow for this option, especially when you have a massive dataset and you wish to explore the metadata first without having to open the file itself. How should this then be stored?

A question I've been thinking of for a while: https://github.com/cityjson/cityjson-qgis-plugin/issues/17

liberostelios commented 4 years ago

I like the idea! It makes sense to split it in another file, especially if you want to have a big chunk of metadata.

I think we could just have the metadata in a separate json file, potentially with the same name but a "metadata" suffix. For instance, if you your original CityJSON file is called den_haag.json (or den_haag.cityjson as per #64) then we can put the metadata in a file called den_haag.metadata.json.

I think the CityJSON file itself does not necessarily need to point to the metadata file, for simplicity. But we could also consider the option of having the metadata's filename as a value of the "metadata" property in the original file.

kenohori commented 4 years ago

I think it's best to keep it in the same file, to be honest. Some reasons:

make sure that data is kept together with metadata
easier to send and receive
avoid sandboxing restrictions

liberostelios commented 4 years ago

But if you have them together, it means that you have to parse the whole file just to get the metadata. In case of a huge city model that's a waste of resources. There are clear benefits of having the option to save metadata in a different file, in my opinion.

* make sure that data is kept together with metadata

* easier to send and receive

Those are a two clear downside. Although, on the other hand, having metadata in a separate file means there is more freedom in exchanging information regarding the discovery of the data through the web (one of the main use cases of CityJSON). So I would argue that it's not necessarily easier to send/receive data. I can definitely see the benefits of having metadata files separately to "cheaply" exchange information about tiled datasets.

* avoid sandboxing restrictions

I am not sure what you mean by that. Are you referring to mobile devices, for instance, where you are working in a restricted sandboxed environment?

Athelena commented 4 years ago

Personally I think it should be a choice. We should just have guidance and support for it from our end. There are advantages and disadvantages to both storage methods, so why not at least allow for both? People use metadata in many ways and as Stelios pointed out, discovery is one such crucial pillar. Supporting separate storage and showing people the tools is only a plus I think. As an option it’s not harmful, right?

kenohori commented 4 years ago

Agreed that there are pros and cons, but my general feeling is that there are way more cons to be honest. And I'd say that unless there are strong use cases for both, we should generally stick to one way of doing things. After all, behind every user choice, there is a developer that needs to implement both approaches (cue GML).

You have to parse the whole file just to get the metadata

You don't need to parse the whole file though... just read up to the metadata object and parse that part. Maybe could have a best practice to store it before the city objects and vertices? We could add a quick metadata viewer as sample code.

Re: sandboxing, I think we're moving towards a future where a program doesn't have general access to the disk (regardless of device). So, a 3D city model should be either:

the JSON file, which is directly provided by the user, or
a compressed (?) folder with multiple files, where the main JSON file specified in some way

hugoledoux commented 4 years ago

I have to agree with Ken here, the advantages are not that many. And the epsg is in metadata, if another file, for something so important, is asking a lot. Basically we enforce all CityJSON files to have 2 files, you can't drag'n'drop anymore in ninja as easily.

You don't need to parse the whole file though... just read up to the metadata object and parse that part. Maybe could have a best practice to store it before the city objects and vertices? We could add a quick metadata viewer as sample code.

This is easier said than done though. Many JSON library do not allow this, not even sure if nlohmann does. Python3.6+ does, but not before. Or let's say, always possible but the work to be done can be painful.

But how much of an issue is it in practice though? With a ~300MB file (Zürich) it takes 16s and it's Python:

time cjio Building_LoD2_V10.json info
Parsing Building_LoD2_V10.json
{
  "cityjson_version": "1.0",
  "epsg": null,
  "bbox": [
    2677116.375,
    1241839.025,
    0.0,
    2689381.984,
    1254150.95,
    1044.25
  ],
  "transform/compressed": true,
  "cityobjects_total": 52834,
  "cityobjects_present": [
    "Building",
    "BuildingPart"
  ],
  "materials": false,
  "textures": false
}
cjio Building_LoD2_V10.json info  16.30s user 1.05s system 99% cpu 17.430 total

liberostelios commented 4 years ago

You have to parse the whole file just to get the metadata

You don't need to parse the whole file though... just read up to the metadata object and parse that part. Maybe could have a best practice to store it before the city objects and vertices? We could add a quick metadata viewer as sample code.

That's not always possible. First of all, not all parsers allow you to partly parse a file. Second, many parsers won't allow you to dictate the position of an individual property when you write to JSON. I know for sure that the Python library does this automatically. Maybe there is an option I am missing here, but that's the default behaviour.

Re: sandboxing, I think we're moving towards a future where a program doesn't have general access to the disk (regardless of device). So, a 3D city model should be either:
1. the JSON file, which is directly provided by the user, or

2. a compressed (?) folder with multiple files, where the main JSON file specified in some way

Not having a direct access to disk doesn't necessarily mean something. Indeed, we need to take into account a more abstractive way of accessing the data, but that could be through multiple queries to a (web) API for instance. I see your point as criticism mostly towards the implied name of the metadata, which is a fair point. But maybe we should think about it the other way around: maybe the first point of accessing a 3D city model is the metadata and the location of the actual data is somewhere else, so you have to follow the link (see more about it below). This is, of course, not the default scenario, but an additional possibility.

I have to agree with Ken here, the advantages are not that many. And the epsg is in metadata, if another file, for something so important, is asking a lot. Basically we enforce all CityJSON files to have 2 files, you can't drag'n'drop anymore in ninja as easily.

I should also clarify that I mean this as an option. I still think that metadata should be able to be stored in the file itself (enough to make it autonomous), but we should allow it to be stored also in a separate file. That's in case of very big files or when having multiple tiles.

The more I think about it, the more I see the advantages. I was thinking about 3D tiles for instance: essentially the data are stored in b3dm files and the access point of the data is the tileset.json which is practically only metadata. For me, that's very much in line with the most typical use case of 3D city models: serving data as tiles. If we allow the metadata to be the "central" point of accessing a 3D city models. This opens a whole range of possibilities for using CityJSON for tiled data. I am sure, for instance, we could take great advantage of that in the 3D BAG project.

Athelena commented 4 years ago

I see the point regarding epsg, but again like Stelios said it's not an 'either/or' scenario and my previous phrasing makes it sound like that's what I meant, sorry. We should support separate storage in cases where people have a need for it, storing metadata for tiles is a wonderful example. Creating data discovery databases is another.

kenohori commented 4 years ago

I fully agree with the data discovery bit, but don't think that's a main purpose of cityjson? Why not create a format specifically suited for that? We've had previous discussions about tiling and master files and so on, and I love the idea, but I think that should be separate from the main cityjson spec. I'd say keep the standard lean and simple, with as few options as possible.

Re: parsing, I don't think you need to send whole files to be parsed as json at all. If you just want an optimised way to get the metadata, just treat it as a plain text file and extract the metadata object (everything between the braces). Just count [(" as +1 and ])" as -1. Then pass only that part of the file to the json parser.

hugoledoux commented 4 years ago

Actually, this discussion is linked to how a Restful/WFS3 service, where metadata cannot be attached to each features, and there has to be a way. I fiddled a bit with this, but Xiaoai is starting in September and she needs to investigate this, so I propose we put this on hold for now.

Re: parsing, I don't think you need to send whole files to be parsed as json at all. If you just want an optimised way to get the metadata, just treat it as a plain text file and extract the metadata object (everything between the braces). Just count [(" as +1 and ])" as -1. Then pass only that part of the file to the json parser.

This sounds like the best solution at this moment, I'll try this this week (in cjio). Let's see how much it improves speed.

hugoledoux commented 4 years ago

its fast, with same file: https://asciinema.org/a/bnC0612i8gyCvTfpMAk1iPucG

liberostelios commented 4 years ago

Actually, this discussion is linked to how a Restful/WFS3 service, where metadata cannot be attached to each features, and there has to be a way. I fiddled a bit with this, but Xiaoai is starting in September and she needs to investigate this, so I propose we put this on hold for now.

I think Restful/WFS3 is a different use case. The main use case here is to use tiled CityJSON with a central metadata file as a static dataset, compared to a dynamic restful service.

This sounds like the best solution at this moment, I'll try this this week (in cjio). Let's see how much it improves speed.

That's cool. But is that the kind of simplicity we want to offer to developers? Writing their own "pre-parser"?

But, personally, I think performance of parsing is the least important of the points raised above. There are other ones like the amount of data needed to exchange through web and the possibility to expand it for tiled datasets, that makes this topic more interesting.

I am willing to investigate the possibilities of a new specification that would describe that. Although, to be honest, I think that it feels a bit like OGCing the problem. For me it would be simpler to experiment in an iterative way: first allow the existing metadata to exist in a different file and then we investigate the possibilities of storing it there. If that expands into something bloated, then we consider of isolating it to its own standard.

hugoledoux commented 4 years ago

I meant: with the WFS3 project, we can figure out what we want to do with metadata, eg do we want to store data/values that can be calculated on-the-fly?

Make a proposal then? But I see issues if both options are supported: some will do half/half... some in the file some in the property. I just don't think that it's the most urgent issue that should be solved to be honest.

Athelena commented 4 years ago

This "on-the-fly" conversation, we've had it before, just because you can calculate it on-the-fly doesn't mean it's not useful to store in metadata. I just think you don't really understand what metadata is for, and how non-developers might use it.

Your over-commitment to simplicity makes it seem like you're allergic to options.

hugoledoux commented 4 years ago

This "on-the-fly" conversation, we've had it before, just because you can calculate it on-the-fly doesn't mean it's not useful to store in metadata. I just think you don't really understand what metadata is for, and how non-developers might use it.

Your over-commitment to simplicity makes it seem like you're allergic to options.

I'll take this as a compliment.

hugoledoux commented 4 years ago

so it's fast but can't be used in cjio easily because the idea of passing the CM (a CityJSON object) from operator to operator breaks here the use... Not sure how to tackle this, will think.

Code here to remember:

import sys
import json

f = open("/Users/hugo/data/cityjson/examples/zurich/Building_LoD2_V10.json", "r")
s = f.read()
#-- find "metadata"
posm = s.find("metadata")
pos_start = s.find("{", posm)
pos_end = 0
cur = pos_start
count = 1
while True:
    a = s.find("{", cur+1) 
    b = s.find("}", cur+1) 
    if a < b: 
        count += 1
        cur = a
    else: 
        count -= 1
        cur = b
    if count == 0:
        pos_end = b
        break
m = s[pos_start:pos_end+1]
jm = json.loads(m)
print(jm)

kenohori commented 4 years ago

Honestly, I feel like this discussion is going off-track. In short, I'd say that keeping metadata in an external file too (ie a copy-ish of the metadata) is fine, but asking developers to check an external file to get basic information about the file they're already reading is not.

That's cool. But is that the kind of simplicity we want to offer to developers? Writing their own "pre-parser"?

I've made this argument before, but simple things (ie reading the metadata out of a file) should be simple (ie not looking around and parsing multiple files). If you want complex things (ie the fastest possible implementation), then doing extra work is okay. Making simple things complex discourages both users and developers.

I am willing to investigate the possibilities of a new specification that would describe that. Although, to be honest, I think that it feels a bit like OGCing the problem. For me it would be simpler to experiment in an iterative way: first allow the existing metadata to exist in a different file and then we investigate the possibilities of storing it there. If that expands into something bloated, then we consider of isolating it to its own standard.

I agree with the iterative approach, but IMO the OGC way is very much adding every single option that someone wants, then not caring whether they're implemented or not...

I just think you don't really understand what metadata is for, and how non-developers might use it.

Not sure what to say, other than this feels... down-putting and uncalled for?

liberostelios commented 4 years ago

Honestly, I feel like this discussion is going off-track. In short, I'd say that keeping metadata in an external file too (ie a copy-ish of the metadata) is fine, but asking developers to check an external file to get basic information about the file they're already reading is not.

I think for basic metadata the external file is a bit of an overkill, indeed. But I am sure there are cases where it would be useful. And I think we need to be open to what people besides developers might need, in this case. I say let's just play around with the concept and see what are the consequences. I know I will do that anyway.

I've made this argument before, but simple things (ie reading the metadata out of a file) should be simple (ie not looking around and parsing multiple files). If you want complex things (ie the fastest possible implementation), then doing extra work is okay. Making simple things complex discourages both users and developers.

I think there is balance to be found between users and developers. Tbh, I don't think users and developers mean simplicity in the same way. For instance, users normally don't care that much about files and specs, but more about how easy it is to do things through their tools. That makes developers' life more complex because they have to hide that through abstraction. That's a long discussion, though, that I am just putting it here as an open thought.

I agree with the iterative approach, but IMO the OGC way is very much adding every single option that someone wants, then not caring whether they're implemented or not...

That's a fair point. I might have exaggerated there.

Not sure what to say, other than this feels... down-putting and uncalled for?

I still think we should look at the point here. As valid as the argument is that we should make things simple for developers, there is still the question of whether this is useful for users and if it helps people access the data.

lordAnubis commented 4 years ago

A fully working file is worth more to the user than a split file where a part of it can get lost. The time of processing the file is then no longer a problem with regard to the search, retrieval and communication with the creator or sender.

Perhaps it would be useful to ask the JSON dev community for a JSON container that can contain 1 or more files.

liberostelios commented 4 years ago

Having a container sound like a nice idea, but I think it would require a solution with a header of some sort and not a typical JSON structure.

But I still think that the metadata and the file itself could be dealt as independent for several reasons, which I mentioned above. It just took me 10' waiting for the Python parser to parse a 2.5GB file, only to realise that there is no practical metadata in there!

lordAnubis commented 4 years ago

I understand that, and your other comments about the pro's, but did you ever wait for the sender of the file to make it clear that you need also the other part and how long it took before you did receive it? Then 10' is nothing.

Also how will User people recognise when they need yes or no the meta data file; when the program ask for it? I'am a developer and a User; on both moments I hate it when I have to open a secondary file.

Also, asking a developer to check for a second file to get the meta information about the file they are already working on has a huge development effect. They must open a file browser so that the user can select the file from anywhere; but now what if a metafile is chosen from an older data version or even completely another json data file. Error's checking and longer wait moments with still unreliable data.

I would say keep it simple; for the developer and the user. In two years' time, the file now processed in 1 minute will be processed in 30 seconds. Perhaps have smaller files created for the time being and make it possible to link two or more adjacent areas automatically.

hugoledoux commented 4 years ago

And having a "metadata extractor" can be done very easily and it's fast, I used that 3.1GB file 3D NL dataset to just print out the property.

with Python JSON lib

f = open("/Users/hugo/data/cityjson/37en2.volledig/37en2.json")
import json
j1 = json.loads(f.read())
bbox = j1['metadata']['geographicalExtent']
print(bbox[0], bbox[1])

=> 144s

with simdjson (https://pypi.org/project/pysimdjson/)

f = open("/Users/hugo/data/cityjson/37en2.volledig/37en2.json")
import simdjson
parser = simdjson.Parser()
doc = parser.parse(f.read())
bbox = doc['metadata']['geographicalExtent']
print(bbox[0], bbox[1])

=> 8s

I'd argue that 8s on a very abnormal (hopefully!) 3.1GB file ain't bad, is it?

liberostelios commented 4 years ago

I understand that, and your other comments about the pro's, but did you ever wait for the sender of the file to make it clear that you need also the other part and how long it took before you did receive it? Then 10' is nothing.

Also how will User people recognise when they need yes or no the meta data file; when the program ask for it? I'am a developer and a User; on both moments I hate it when I have to open a secondary file.

Also, asking a developer to check for a second file to get the meta information about the file they are already working on has a huge development effect. They must open a file browser so that the user can select the file from anywhere; but now what if a metafile is chosen from an older data version or even completely another json data file. Error's checking and longer wait moments with still unreliable data.

That's valid criticism, but it's based on the false assumption that the user will first open the CityJSON file itself and then the metadata. There's no real benefit to that, it's a lost cause and the cons far outweigh the pros in that case and that's not how metadata (should) work. I am convinced that the plain argument of performance doesn't worth the pain.

But the idea is that the metadata file is the first (and only) file that the user opens. The developer is responsible to open it and analyse, so they know what to expect. Then the metadata contain the link(s) to the file(s) of the dataset that they are related to.

I need to clarify again: if one wants to open an independent CityJSON file itself, it should always be fine. You shouldn't need any external files to work with it (the file should be autonomous and still have a crs and everything). But if you want to work with the whole dataset (i.e., knowing the extents of all the tiles, pick the tile that you want etc.) then you entry point should be the metadata file.

That solves a huge issue for users. Right now, those files are distributed in very unfriendly ways (e.g. just listed through FTP or with some minimal platforms that give you an extract of 2D tilesets). In addition, the user can download just a small file with the metadata, that: a) informs the user about what to expect from the dataset (e.g. extents, attributes, LoDs present etc.); and b) has the links to the tiles of the dataset itself. It's the developer's responsibility to offer them a nice way to interact with that.

Yes, as developers we can just say "I don't want to deal with that, the user is responsible for how they find a file", but that's not to say this isn't a real problem. Unifying the way tiled datasets are distributed (which is the vast majority for CityJSON) is a huge plus there and I believe it's worth the pain.

balazsdukai commented 3 years ago

That's valid criticism, but it's based on the false assumption that the user will first open the CityJSON file itself and then the metadata.

I would say that your statement on "false assumption" is false itself. I certainly do open the data files first, and let's say I'm a tech-savvy user. My experience dictates that less tech-savvy users are even less concerned about metadata (or readme-s, or documentation etc.).

The issue that you mention as "huge", I see differently. In my opinion, metadata needs to be displayed at the source first, through some webpage, table, api etc. This is checked, queried by the user before downloading any files. But presenting the metadata at the data source is the task of the data provider, and the way of presentation depends on what fits the target audience the best. There is no need for a dedicated, standardized metadata file. I think a fine example for this is the BGT download API at PDOK. Here one can query the metadata first, before downloading a "Full Predefined" data set (could be the equivalent of a cityjson file). https://api.pdok.nl/lv/bgt/download/v1_0/ui/#/Full%20Predefined Once a file is downloaded, it is cleaner, simpler, easier I think for both the user and developer to expect the metadata within the same file of the data. Reading the metadata from a full cityjson file is a non-issue, both in programming complexity and in performance, as the test from @hugoledoux proves it.

I believe that simplicity and performance the two main arguments for cityjson, and these two properties make it as user-friendly as possible. Keeping attachment files and having to account for them would reduce this simplicity.

Finally, it seems to me that a fundamental difference in opinions in this thread lies in what we think cityjson "should" or "shouldn't" be.

Unifying the way tiled datasets are distributed (which is the vast majority for CityJSON) is a huge plus there and I believe it's worth the pain.

Your comment @liberostelios indicates to me that you might see the future of cityjson more akin to Cesium's 3DTiles and other tiled standards. In that context having separate metadata files are completely sensible, as it would serve a similar role to 3DTiles' tilesets. However, I think that any sort of tiling scheme should not be part of the core specification, and that cityjson's main goal is to facilitate the exchange of 3d city models in flat files, that are small in size and relatively easy to use/process. Hence, my arguments for embedded metadata. But this goes beyond metadata, and better be discussed in a separate thread.

liberostelios commented 3 years ago

I would say that your statement on "false assumption" is false itself. I certainly do open the data files first, and let's say I'm a tech-savvy user. My experience dictates that less tech-savvy users are even less concerned about metadata (or readme-s, or documentation etc.).

I probably didn't explain myself very clear there. When I describe a user that opens a file, I am not really talking about a user that literally opens the metadata file through a text editor, but someone that uses a tool that supports the parsing of metadata. You might think of it as some form of a torrent file; the user doesn't open the file to see what's in it, they just open it with a torrent client and the latter is responsible about resolving everything and taking care of the actual data.

The issue that you mention as "huge", I see differently. In my opinion, metadata needs to be displayed at the source first, through some webpage, table, api etc. This is checked, queried by the user before downloading any files. But presenting the metadata at the data source is the task of the data provider, and the way of presentation depends on what fits the target audience the best. There is no need for a dedicated, standardized metadata file. I think a fine example for this is the BGT download API at PDOK. Here one can query the metadata first, before downloading a "Full Predefined" data set (could be the equivalent of a cityjson file). https://api.pdok.nl/lv/bgt/download/v1_0/ui/#/Full%20Predefined Once a file is downloaded, it is cleaner, simpler, easier I think for both the user and developer to expect the metadata within the same file of the data. Reading the metadata from a full cityjson file is a non-issue, both in programming complexity and in performance, as the test from @hugoledoux proves it.

I think this is our fundamental disagreement. I agree that the provider should provide the metadata, but I strongly disagree that their purpose is to stick to the source and only serve an introductory purpose. For me, the metadata is part of the dataset itself and whatever information you are getting from them should be available independent of the source that you downloaded it from.

For instance, imagine someone wanting to download the whole 3D BAG. Right now, the website contains all metadata about tiling etc. But the website and whatever functionalities we offer is the only way to describe the whole dataset. After you download an individual tile, you lose any track of the whole context and you can't retrieve that unless you visit the source again. This means, there is no fundamental way of describing the whole 3D BAG without visiting 3dbag.nl. To me, that's a big downside. I should be able to know information about the whole dataset without relying on a specific data point on the web.

I believe that simplicity and performance the two main arguments for cityjson, and these two properties make it as user-friendly as possible. Keeping attachment files and having to account for them would reduce this simplicity.

That's a fair point that I have acknowledged from the beginning. But, you can't make an omelet if you don't break some eggs. Having indexed vertices also reduces simplicity, but CityJSON does it because it optimises the file size. It's just a matter of prioritising features after all, not just plainly reducing simplicity.

Finally, it seems to me that a fundamental difference in opinions in this thread lies in what we think cityjson "should" or "shouldn't" be.

Indeed.

Your comment @liberostelios indicates to me that you might see the future of cityjson more akin to Cesium's 3DTiles and other tiled standards. In that context having separate metadata files are completely sensible, as it would serve a similar role to 3DTiles' tilesets. However, I think that any sort of tiling scheme should not be part of the core specification, and that cityjson's main goal is to facilitate the exchange of 3d city models in flat files, that are small in size and relatively easy to use/process. Hence, my arguments for embedded metadata. But this goes beyond metadata, and better be discussed in a separate thread.

I find this statement a bit contradictory. On the one hand, you acknowledge that CityJSON files should be flat and serve more the purpose of storing the geometric/semantic data, but on the other hand you suggest that embedding all metadata in the file (as it is now) is the best solution.

I think the proposed solution of handling metadata outside the CityJSON file actually serves the purpose of simplifying the data file itself. This is because the CityJSON file itself can be responsible only for geometry/semantics (plus some basic metadata at the tile level), while all the heavy work of describing the whole dataset (its tiling scheme, lineage, data sources etc.) can be delegated to an independent file.

I guess it's a fair point to say that such a specification can't be at the core. But I disagree with the notion that the tiling scheme is something different than the metadata. To me, any dataset description contains the tiling scheme and the purpose of metadata is to describe the whole dataset.

TkTech commented 3 years ago

with simdjson (https://pypi.org/project/pysimdjson/)
f = open("/Users/hugo/data/cityjson/37en2.volledig/37en2.json")
import simdjson
parser = simdjson.Parser()
doc = parser.parse(f.read())
bbox = doc['metadata']['geographicalExtent']
print(bbox[0], bbox[1])
=> 8s

I'd argue that 8s on a very abnormal (hopefully!) 3.1GB file ain't bad, is it?

@hugoledoux If there are common use patterns for accessing the metadata, I don't mind looking at adding support for them directly in pysimdjson. For example, if you have a looong list of points in geographicalExtent or boundries, you could look at using as_buffer(), which will return a Buffer. These can be passed directly into numpy and other high-performance mapping libraries, or other C extensions for visualization, without ever creating Python objects. It is much faster than any Python approach.

If you just want to pull some out and drop it next to the main file, calling parse() and then using .mini with a JSON pointer will be the fastest approach with the lowest memory overhead:

parser = simdjson.Parser()
# Lower memory overhead, let it read directly from disk in native code.
doc = parser.load("/Users/hugo/data/cityjson/37en2.volledig/37en2.json") 
with open('extents.json', 'wb') as fout:
    # Using a pointer avoids the creation of a Dict for the metadata object.
    # Using mini returns a fast byte string
    fout.write(doc.at_pointer('/metadata/geographicalExtent').mini)

Athelena commented 3 years ago

Since metadata is moving out of the core, I think this discussion isn't relevant anymore. We can think of anything else as part of an extension.

cityjson / specs