Closed ceteri closed 4 years ago
Whoever works on this issue, please talk with @philipskokoh to compare notes
Then update instructions for launch and use in Downloading-Resource-Files
Hi @ceteri, I think spv2 is not a good option since it's no longer being maintained (deprecated). The current spv2 only parses title, authors, and bibliographies (see this issue). It also skews toward medical documents (see https://github.com/allenai/spv2/issues/42#issuecomment-444665509).
Another option that might be better is grobid.
Many thanks @philipskokoh !
@JasonZhangzy1757:
- evaluating use of
Parsr
and now I had the code running and could generate raw Txt and JSON from PDF.Parsr
provides UI and python client, so I think it is not hard for us to use.- What kind of evaluation metrics do you think we should perform before deciding which parser is simpler to use.
Great! I'll describe about evaluation metrics in a following comment here.
For next steps:
Parsr
as a sampleresources/pub/pdf
subdirectory of this repoParsr
to produce JSON files in resources/pub/json
-- replacing what is described as our previous approachAs long as Parsr
is not difficult to use and it produces the results as claimed, then it'll provide a better approach than the other candidates -- until something better emerges:
SPv1
or SPv2
which conflated those (unless I've misunderstood any points)That will at least get us to the point of having reasonably good semi-structured text from PDFs, so that tools such as PyTextRank and other lightweight entity linking approaches can be used to extract key phrases from the papers in the corpus. That's needed for extending our workflow in RCGraph
, see: https://github.com/Coleridge-Initiative/RCGraph/issues/40
A subsequent project would be to take the JSON produced by Parsr
and train ML models to classify sections of a research paper. In other words, reproducing the intent of SPv2
but updated for our needs.
I have a strong hunch that some of the more lightweight modeling would be a good fit, perhaps such as BiLSTM, and also that we may engage Recognai with a contract for that :)
# Module Import
from parsr_client import ParserClient
import json
# Initialize the Client Object
parsr = ParserClient('localhost:3001')
# Send Document for Processing
job = parsr.send_document(
file='./sampleFile.pdf',
config='./sampleConfig.json',
document_name='Sample File2',
wait_till_finished=True,
save_request_id=True,
)
# Get the Full JSON Output
with open("./sample.json", 'w', encoding="utf-8") as outfile:
json.dump(parsr.get_json(), outfile, indent=2, ensure_ascii=False)
URL: https://www.kidney-international.org/article/S0085-2538(15)50504-8/pdf
Parsr
is here.from parsr_output_interpreter import ParsrOutputInterpreter
parsr_interpreter = ParsrOutputInterpreter(parsr.get_json())
t = parsr_interpreter.get_text(page_number=1)
Output: https://github.com/JasonZhangzy1757/Parsr/blob/master/demo/jupyter-notebook/Output.txt
I would say this looks quite promising :)
If you think it's OK then I'll start working on py scripts.
Nice work!! That looks great.
Yes, let's move ahead to Py scripts.
Here is a simple version of the script. The script should be placed in the bin /
directory together with its dependent package parsr_client.py
. It seems to work well.
URL: https://github.com/JasonZhangzy1757/rclc/blob/master/bin/parsr.py
So, to summarize the entire procedure:
We use Parsr
to extract text and JSON from research publications. The quickest way to install and run the Parsr API is through the docker image:
docker pull axarev/parsr
To run the API, issue:
docker run -p 3001:3001 axarev/parsr
-- The advanced guide is available here. --
Then run the parsr.py
script to extract text and JSON from the PDF files:
python bin/parsr.py
The outputs will be saved in json
and text
folders. It might be quite time-consuming though. Perhaps we should also upload it onto AWS S3 Bucket.
This looks good @JasonZhangzy1757 !
Go ahead and create a PR for this. And yes, the next step in this pipeline does upload the JSON to an S3 bucket https://github.com/Coleridge-Initiative/rclc/wiki/Downloading-Resource-Files#upload-pdf-and-json-files
I'll create an issue for the next stage to add to this pipeline, which is to run phrase extraction from this JSON output here.
Here's a next stage to build: https://github.com/Coleridge-Initiative/rclc/issues/20
It seems GitHub doesn't support pull requests for the wiki repository. https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github
So I have edited the wiki page directly.
great!
Rework the pipeline, after PDF download, so that text gets extracted in a semi-structured way.
Some options to evaluate:
Parsr
https://github.com/axa-group/Parsrgrobid
https://github.com/kermitt2/grobidIf a package has any dependencies on JVM, then it's best to containerize that part of the workflow with Docker; we don't want to be managing JVM apps. As much as possible, use instances from DockerHub instead of creating new ones.
In the case of
grobid
there are already several DockerHub instances: https://hub.docker.com/search?q=grobid&type=imageRCLC
workflow?