curl / trurl

trurl is a command line tool for URL parsing and manipulation.
https://curl.se/trurl/
Other
3.1k stars 99 forks source link

Add --json-file/-j option to read from file w/ json contents #311

Closed jacobmealey closed 1 week ago

jacobmealey commented 3 months ago

This adds --json-file as an option to read from a file that contains a json representation of the urls. It works very similarly to --url-file. The --json-file input structure is designed to match roughly what the output of --json produces. the simplest json object possible is:

[{ "parts": { "host": "example.com" }}]

You can pass an arbitrary number of these url JSON objects and perform operations on them like you would with any other url in trurl.

Something that I did that may be up for debate is how the query is handled. I chose to make the query only be able to be set by an array of "keys" and "values" (the "params" section of --json") and it will throw a warning if the "query" component of "parts" is set.

A few examples:

$ cat testfile/test004.txt
[
  {
    "parts": {
      "scheme": "ftp",
      "user": "scream",
      "host": "url.com",
      "path": "/",
      "query": "ignoredquery",
      "fragment": "a-fragment"
    },
    "params": [
      {
        "key": "query",
        "value": "pair"
      },
      {
        "key": "singlequery",
      }
    ]
  },
  {
    "parts": {
      "scheme": "http",
      "host": "example.org",
      "path": "/"
    }
  }
]
$ trurl -j testfiles/test0004.txt --set "fragment=a-new-fragment" 
trurl note: ignoring 'query', provide a separate 'params' array.
ftp://scream@url.com/?query=pair&singlequery#a-new-fragment
http://example.org/#a-new-fragment
$ trurl -j testfiles/test0004.txt --append "path=file"
trurl note: ignoring 'query', provide a separate 'params' array.
ftp://scream@url.com/file?query=pair&singlequery#a-fragment
http://example.org/file

Currently the feature is disabled by default, do turn it on you need to add TRURL_JSON_IN to your environment to enable it.

fixes: #291

jacobmealey commented 3 months ago

okay I need to figure out whats going on with the windows builds, the cygwin was was working on my local branch until very recently, might be something i changed on the way we are doing the file i/o. I will take a closer look this evening.

Edit: Okay, i got the cygwin build working. I just need to add libjson-c to the curl-for-win build. @vszakats would this be done by manually compiling and linking it in here: https://github.com/curl/curl-for-win/blob/main/trurl.sh ?

vszakats commented 3 months ago

It's a new dependency, kind of a big deal to add support for one in general. One approach is to make it a build option, disabled by default, so existing trurl builds (including curl-for-win) continue to work without modification. Another option is to delete/disable the curl-for-win CI job till I (or someone else) get around adding support for it. A 3rd one is to add an option to disable this feature, and I update curl-for-win to set it.

vszakats commented 3 months ago

...that said, have you thought of using a JSON parser that's a self-contained C89 source (with a fitting license)? It's still a dependency, but perhaps easier to integrate and carry.

jacobmealey commented 3 months ago

have you thought of using a JSON parser that's a self-contained C89 source

I was looking a lot at https://github.com/DaveGamble/cJSON, but I didn't like how it looks and json-c has some nice features. Both this and json-c are in vcpkg, freebsd repos, and most linux distro repositories so i was hoping it wouldn't be too much work either way for maintainers.

If folks want a smaller ANSI c json library instead of the one I chose, I think that's a fair request and I can start working this again.

for now, I will put it behind a compile time flag as you suggested

Edit: There are a ton of other libraries listed on json.org that could be used too.

jacobmealey commented 3 months ago

@bagder, i've made a few big modifications. Mainly I made it so it reads one json object at a time, so it now only allocates as much memory as required to parse the largest single json object, instead of attempting to read until the of the file. I also made it so it reads in a fixed buffer on the stack, and then allocates more data on the heap if need be as you suggested.

sevan commented 2 months ago

With libjson-c installed (I tested against v0.17), I was able to build your branch using make TRURL_JSON_IN=1 JSON_C_PREFIX=/bla

Worth setting JSON_C_PREFIX ?= /usr/local in Makefile so it only requires make TRURL_JSON_IN=1? (PREFIX also defaults to /usr/local)

bagder commented 1 week ago
  1. I posted an informal poll on mastodon where right now 74% says they think we should support JSON input.
  2. A concern I have is the JSON object: as there does not seem to be a consensus for a "URL object" in JSON, it seems reasonable to assume that other producers of JSON URLs will do it differently and therefore people will have to transpose the JSON object (using jq presumably) in order to get trurl to handle it. And if they must transpose, they could also just output a plain URL... and then, what value does JSON input add?
  3. This PR has merge conflicts atm
jacobmealey commented 1 week ago

Hello, sorry for letting this PR go stale. I have been thinking about this for a while. I agree with bagder and others that we don't get much from having JSON input. It just an easy feature to suggest. At first glance it makes sense that if we produce JSON we should be able to consume it, but as has been mentioned in this PR as well as the related issues, we wouldn't get much mileage out of it. The only real use case I can think of is using some external program to act as a complex filter / manipulator in between two trurl calls:

$ trurl -f some_urls.txt --json | a_very_complicated_filter.py | trurl -j - > filtered_urls.txt

but like has been said so many times this can be done by adding in a jq call to the pipeline.

Moving away from the user experience side into the technical implementation, I don't like these changes. I think the library I chose is really nice to use but it's overkill for this use case and a much simpler one should have been used. @bagder I think we should close this and open a discussion around it, i do not think this code should be merged in.

bagder commented 1 week ago

Thanks for being so open and flexible @jacobmealey, even after having spent the time doing this PR. Closing this now.

I remain unconvinced that JSON input is very useful because the reasons mentioned above.