axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.83k stars 310 forks source link

simpleJson endpoint is not presented by API (or at least is difficult to find!) #497

Closed ben-lavelle closed 2 years ago

ben-lavelle commented 4 years ago

Summary

The simpleJson output described in the docs looks to be a useful output in simple use-cases. However, it is not implemented in the python client, and doesn't appear to be implemented in the API (or is at least it beats me how to find it!).

I can't see any other issues or PRs that reference simpleJson; apologies if this is a duplicate (or straight-up wrong!)

Steps To Reproduce

This seems to me to be regardless of what config file and input file is used, so the Jupyter notebook example with defaultConfig.json is sufficient to demonstrate.

Steps to reproduce the behaviour:

  1. Run the Jupyter notebook example as here.
  2. Send a document, e.g. the sample document provided, with defaultConfig.json or any config with simpleJson output requested and check other outputs are as expected.
  3. Call e.g. parsr.get_status() and note that even if requested, the endpoint for simpleJson output is not listed in the response (example request hash):
{'request_id': '01cecaa5bd8624aa9d8a1d899dde73',
 'server_response': {'id': '01cecaa5bd8624aa9d8a1d899dde73',
  'json': '/api/v1/json/01cecaa5bd8624aa9d8a1d899dde73',
  'csv': '/api/v1/csv/01cecaa5bd8624aa9d8a1d899dde73',
  'text': '/api/v1/text/01cecaa5bd8624aa9d8a1d899dde73',
  'markdown': '/api/v1/markdown/01cecaa5bd8624aa9d8a1d899dde73'}}
  1. Visit other endpoints with e.g. (substituting request hash)
http://localhost:3001/api/v1/json/01cecaa5bd8624aa9d8a1d899dde73
  1. Find no such endpoint for /api/v1/simplejson or /api/v1/simple-json, api/v1/simple/json etc.

  2. See in the API logs that the simpleJson output was in fact produced, e.g.

[2020-10-08T15:43:10] INFO  (parsr-api/7 on 9da727d13778): Exporting markdown...
[2020-10-08T15:43:10] INFO  (parsr-api/7 on 9da727d13778): Writing file: /opt/app-root/src/api/server/dist/output/sampleFile-01cecaa5bd8624aa9d8a1d899dde73/sampleFile.simple.json
[2020-10-08T15:43:11] INFO  (parsr-api/7 on 9da727d13778): Writing file: /opt/app-root/src/api/server/dist/output/sampleFile-01cecaa5bd8624aa9d8a1d899dde73/sampleFile.txt
[2020-10-08T15:43:11] INFO  (parsr-api/7 on 9da727d13778): Exporting markdown...
[2020-10-08T15:43:11] INFO  (parsr-api/7 on 9da727d13778): Writing file: /opt/app-root/src/api/server/dist/output/sampleFile-01cecaa5bd8624aa9d8a1d899dde73/sampleFile.md

but fails to be presented as a discoverable endpoint.

Expected behavior

An endpoint should be available for the useful simpleJson output. It'd also be nice to add a get_simpleJson() method to ParsrClient or add parsr.get_json([simple: bool=False]) option to the existing full JSON method.

Actual behavior

No endpoint found at some sensible addresses; no way that was apparent to me of accessing the simpleJSON output. Correspondingly no way to access simpleJSON from the py client.

Environment

Additional context

I would be happy to put together the py client implementation as a PR if you think it is a useful feature, but my knowledge of the API in js is absolutely minimal so I'm not sure I can fix this part. For the same reason sorry if this is a daft issue and I'm missing something obvious!

agarwal-nitesh commented 2 years ago

@ben-lavelle Did you end up extracting the file from the location (container/host) or did you find a way to hook an endpoint that serves the simple JSON?

kleag commented 2 years ago

I found commit 01407377ff7cfef52132fa979b212e4e5010614d which seems to implement the requested API with the url "simple-json". It was from Dec. 21, so I supposed that the docker image axarev/parsr:master would contain it. Unfortunately, when I execute curl -X GET http://localhost:3001/api/v1/simple-json/51966c63fe06c7472add19bbbdb6c5, it still get:

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Error</title>
</head>
<body>
<pre>Cannot GET /api/v1/simple-json/51966c63fe06c7472add19bbbdb6c5</pre>
</body>
</html>

I suppose that there is still missing bits to implement the feature? @agarwal-nitesh , you are the one who commited the change. Maybe have you insights about that ? (I have really no knowledge about JavaScript so I have difficulties pursuing the search).

agarwal-nitesh commented 2 years ago

Hi @kleag. I built using the docker-compose-build file and the API worked but some staged changes were not part of the above commit and are required. Here they are https://github.com/axa-group/Parsr/pull/593. My apologies.

kleag commented 2 years ago

Don't apologize @agarwal-nitesh . In the contrary, thanks a lot for your contribution and help! With your clone, I was able to build and use the docker composed images.

I had to copy ./docker/parsr-base/policy.xml to the root of the project and to create a user with UID 1001 on my system (my user is 1000) to make it work.

I also had to add "simpleJson":true in the config.json file. It would be more coherent if it would be simple-json as in the endpoint. The status message should also contain an entry for the simple-json result.

Maybe the two above points should be new issues in fact.

BinaryBrain commented 2 years ago

The endpoint has been added by @agarwal-nitesh on PR #593