Closed ca16 closed 3 years ago
@andrewhead thanks for the comments! They were helpful.
I've made a few updates based on your suggestions, and attempted to answer your questions.
I think the one still open question is around using Literal
instead of Enum
, from this comment.
hi @ca16 ! Awesome PR, I am already using these flags!!
Would it make sense to rename the extension of the output from .jsonl
to .json
, given that the file contains just one line and can parsed as single JSON object? Sorry, I should have thought to ask this before, though I only realized it when I started working with the file.
Yeah I think that's a good idea! I can make the change.
Related to https://github.com/allenai/scholar/issues/28994.
As part of getting citations info from scholarphi piped through the s2airs system, we want to get citations info into the annotation store. We can adapt the wrapper system (scholarphi-pipeline) to do this, if it has access to the output that the inner scholarphi code produces.
The simplest way that I found to do this while I was working on https://github.com/allenai/scholar/issues/28297 was to just dump the output as json into a file in a location provided by the caller.
This PR adapts the DatabaseUploadCommand a little, making it more of a 'save output' kind of thing (I haven't renamed it to try to keep this PR small, and avoid changing more than is necessary). The idea is the user can specify whether they want output saved to a db (what we've been doing up until now), saved to a file (what we want for s2airs), or both. To begin with though, only the citations extension of DatabaseUploadCommand will know how to actually save to a file. Other commands extending DatabaseUploadCommand should error out if the user tries to specify that they want output saved to a file for those. To get an idea of how we could extend this to all entities, check out this branch: https://github.com/allenai/scholarphi/compare/chi-2021-demo...chloea-try-producing-json-take-1 (note: it does have some extra stuff around pulling arxiv pdfs that isn't relevant to this, and the saving to file stuff is less neat, and the 'both' option doesn't exist, but hopefully it gets the idea around extending to other entities across).
The plan is that the existing scholarphi-pipeline system would use the 'both' option when processing in citations only mode, so that we can continue to support the current reader, and so that we can put stuff into the annotation store as we build out s2airs.
Testing done: I added some tests and ran pytest:
I ran one paper through using the file only format and just citations:
I ran it with the file only option and a non-citations entity (equations). It failed (as we would expect):
Note: this PR is for merging into the chi-2021-demo branch, as that is what the scholarphi-pipeline system is still running off of.