Support separate but local data files via data.url property for Vega-Lite

Thank you very much for supporting Vega and Vega-Lite. I works nicely.

I have a specific request regarding the location of the visualized data:

I prefer keeping the visualized data in a separate and local local file (instead of inlined in the diagram specification) via the "data.url" property (https://vega.github.io/vega-lite/docs/data.html#url), because it can be smaller (e.g. CSV instead JSON), can be edited easier, there is a better separation of concerns....

Example: I prefer

{
    "data": {
        "url": "data.csv"
    }
}

over

{
    "data": {
        "values": [
            {
                "a": "2020-01-05",
                "b": 0.3,
                "c": "C1"
            },
            {
                "a": "2020-01-15",
                "b": 0.7,
                "c": "C1"
            }
        ]
    }
}

A separate data file is currently not possible with "asciidoctor-kroki" (It works when using Vega-Lite directly.), because an error is created:

Skipping vegalite block macro. No such file: https://kroki.io/vegalite/svg/....

I suppose, that the reason for this error is, that the referenced file containing the data (i.e. "data.csv") is not uploaded to kroki. One "workaround" could therefore be to make "data.csv" publicly available:

{
    "data": {
        "url": "http://.../data.csv"
    }
}

But this does not work either (It works when using Vega-Lite directly, i.e. without "asciidoctor-kroki"). Even if this remote URL would work, it would have the disadvantage the file must be available before the asciidoctor generation with :kroki-fetch-diagram: true.

I created a public repo where you can find all three variants: https://gitlab.com/winni/asciidoctor-kroki-vegalite

The generated result can be found here: https://winni.gitlab.io/asciidoctor-kroki-vegalite/test.html

I don't know if there is a feasible solution to support local data files and how much effort it would be to implement.

Perhaps the "workaround" with remote URLs can be implemented easily (or could it cause a security problem)?

I don't know if this is a very special use case in combination with Vega-Lite, or if this or similar requirements would make sense for other types of diagrams.

What do you think?

@stenzengel i also like the referenced data or include features of such languages to support separation of concerns.

I think it's similar to my request for PlantUML and @Mogztter also mention Vega's data: https://github.com/Mogztter/asciidoctor-kroki/issues/49#issuecomment-616730561

So it's disabled for security reasons. I like the idea of a preprocessing task that can fetch the data or include from remote location (http) to local "cache" or just use the local file and then embed (replace) this in the original file.

I suppose, that the reason for this error is, that the referenced file containing the data (i.e. "data.csv") is not uploaded to kroki. One "workaround" could therefore be to make "data.csv" publicly available

Kroki does not fetch data from local or remote location for security reasons. I'm not a security expert but I believe that it's a major security concern to load arbitrary data from untrusted origin.

I think it's similar to my request for PlantUML and @Mogztter also mention Vega's data:

49 (comment)

So it's disabled for security reasons.

Yes you are absolutely right.

I like the idea of a preprocessing task that can fetch the data or include from remote location (http) to local "cache" or just use the local file and then embed (replace) this in the original file.

Indeed, I think that's the right way to solve this issue :+1:

In the case of Vega-Lite and if you want your data in CSV format, it is unfortunately not only an inlining (or embedding or replacing) of the data file by a preprocessing task on the client side, since the format must be changed from CSV to JSON. In fact the referenced file can have multiple formats: "json", "csv", "tsv", "dsv" and can even be interpreted by a "parse" object (https://vega.github.io/vega-lite/docs/data.html#format).

In fact, I had originally hoped that a generic "upload of referenced files" or "upload of multiple files" solution could be implemented. What exactly would be the problem to upload the referenced data file (in my example "data.csv") together with the referencing source file (the vega-lite JSON view specification, https://vega.github.io/vega-lite/docs/spec.html) to the server? I don't know how kroki works, but I could imagine that the uploaded JSON file is stored in a kind of sandbox anyway, so that the respective visualization library (Vega-Lite in my case) can easily work with it. I do not see an increased security risk if a CSV file is additionally saved.

When both files are transferred to the server, a client-side conversion from CSV (or whatever format) to inlined JSON would not be needed, but the existing functionality of Vega-Lite could be used.

If this solution could be implemented, it would be a generic solution for all visualization libraries where files can be referenced (e.g. PlantUML).

So if that was feasible, there would still be the question of how to decide which files to upload in addition to uploading the referencing file. But for this there is certainly a solution (e.g. by explicitly specifiying in the Asciidoc macro, or by diagram-type-specific parsing of the referencing file, etc.).

This sounds like a lot of work, but could also open up some new possibilities. What do you think?

@Mogztter We once talked about an asciidoctor.js extension especially for Vega-Lite. Meanwhile I find the kroki approach much easier to use for the end user and has a lot of potential, if such problems as referenced files can be solved for kroki. If this idea is feasible, and I can help, please let me know.

In the case of Vega-Lite and if you want your data in CSV format, it is unfortunately not only an inlining (or embedding or replacing) of the data file by a preprocessing task on the client side, since the format must be changed from CSV to JSON. In fact the referenced file can have multiple formats: "json", "csv", "tsv", "dsv" and can even be interpreted by a "parse" object (https://vega.github.io/vega-lite/docs/data.html#format).

I think that's not an issue because you can inline the data, see: https://vega.github.io/vega-lite/docs/data.html#inline

{
  "data": {
    "values": "a\n1\n2\n3\n4",
    "format": {
      "type": "csv"
    }
  },
  "mark": "point",
  "encoding": {
    "y": {"field": "a", "type": "quantitative"}
  }
}

In fact, I had originally hoped that a generic "upload of referenced files" or "upload of multiple files" solution could be implemented. What exactly would be the problem to upload the referenced data file (in my example "data.csv") together with the referencing source file (the vega-lite JSON view specification, https://vega.github.io/vega-lite/docs/spec.html) to the server? If this solution could be implemented, it would be a generic solution for all visualization libraries where files can be referenced (e.g. PlantUML).

I don't want to implement a multiparts form request because:

It won't work with GET requests
The API will be more complicated
I don't want to handle files server-side (the less we rely on the disk the better)

I don't know how kroki works, but I could imagine that the uploaded JSON file is stored in a kind of sandbox anyway, so that the respective visualization library (Vega-Lite in my case) can easily work with it. I do not see an increased security risk if a CSV file is additionally saved.

It's not really a security concern even though I want to disallow the containers to write on disk (because they don't have to). I just don't want to do additional works on the server-side.

When both files are transferred to the server, a client-side conversion from CSV (or whatever format) to inlined JSON would not be needed, but the existing functionality of Vega-Lite could be used.

I strongly believe that a diagram library should not care about reading/including files. It's a nice feature but they should take an input and produce an image, that should be their main focus. Fortunately, Vega-Lite is well-designed and it's possible to inline data so all is good :+1:

We once talked about an asciidoctor.js extension especially for Vega-Lite. Meanwhile I find the kroki approach much easier to use for the end user and has a lot of potential, if such problems as referenced files can be solved for kroki. If this idea is feasible, and I can help, please let me know.

Indeed! @stenzengel If you want to help, you can try to implement what I described above:

parse Vega/Vega-Lite input (I think they are using json5)
extract data.url
read the file content
replace data.url with the inline data data.values

For PlantUML the approach is similar, expect we are looking for include directives. We are using an interface (called Virtual File System) to read files from different context (Antora, Browser, Node) so you will probably need to use the vfs.read function to get the file content.

Let me know if you have any questions.

I think that's not an issue because you can inline the data, see: https://vega.github.io/vega-lite/docs/data.html#inline

I didn't know that. That's cool!

I don't want to implement a multiparts form request because:
* It won't work with GET requests

* The API will be more complicated

* I don't want to handle files server-side (the less we rely on the disk the better)
I'm convinced. Your proposition is the right way to go.

It's not really a security concern even though I want to disallow the containers to write on disk (because they don't have to). I just don't want to do additional works on the server-side.

Yes, KISS.

I strongly believe that a diagram library should not care about reading/including files. It's a nice feature but they should take an input and produce an image, that should be their main focus. Fortunately, Vega-Lite is well-designed and it's possible to inline data so all is good 👍

Yes and I agree that Vega-Lite is very nice.

Indeed! @stenzengel If you want to help, you can try to implement what I described above:
* parse Vega/Vega-Lite input (I think they are using json5)

* extract `data.url`

* read the file content

* replace `data.url` with the inline data `data.values`
For PlantUML the approach is similar, expect we are looking for include directives. We are using an interface (called Virtual File System) to read files from different context (Antora, Browser, Node) so you will probably need to use the vfs.read function to get the file content.

Let me know if you have any questions.

I'll try and I will certainly have questions. Thanks for your offer to help.

asciidoctor / asciidoctor-kroki

Support separate but local data files via data.url property for Vega-Lite #53

49 (comment)