Replace SPv1 with a better PDF parser

ceteri commented 4 years ago

Rework the pipeline, after PDF download, so that text gets extracted in a semi-structured way.

Some options to evaluate:

Parsr https://github.com/axa-group/Parsr
grobid https://github.com/kermitt2/grobid

If a package has any dependencies on JVM, then it's best to containerize that part of the workflow with Docker; we don't want to be managing JVM apps. As much as possible, use instances from DockerHub instead of creating new ones.

In the case of grobid there are already several DockerHub instances: https://hub.docker.com/search?q=grobid&type=image

which is simplest for us to use?
- what's the best way to integrate into our RCLC workflow?

ceteri commented 4 years ago

Whoever works on this issue, please talk with @philipskokoh to compare notes
Then update instructions for launch and use in Downloading-Resource-Files

philipskokoh commented 4 years ago

Hi @ceteri, I think spv2 is not a good option since it's no longer being maintained (deprecated). The current spv2 only parses title, authors, and bibliographies (see this issue). It also skews toward medical documents (see https://github.com/allenai/spv2/issues/42#issuecomment-444665509).

Another option that might be better is grobid.

ceteri commented 4 years ago

Many thanks @philipskokoh !

ceteri commented 4 years ago

@JasonZhangzy1757:

evaluating use of Parsr and now I had the code running and could generate raw Txt and JSON from PDF.

Parsr provides UI and python client, so I think it is not hard for us to use.

What kind of evaluation metrics do you think we should perform before deciding which parser is simpler to use.

Great! I'll describe about evaluation metrics in a following comment here.

For next steps:

please add into comments on this issue some sample code for calling as a Python client to produce JSON
then, also in comments, show a URL for an example PDF and attach the JSON file produced by Parsr as a sample
write a Py script to scan the list of PDFs in the resources/pub/pdf subdirectory of this repo
then have that Py script use Parsr to produce JSON files in resources/pub/json -- replacing what is described as our previous approach

ceteri commented 4 years ago

Evaluation Metrics

As long as Parsr is not difficult to use and it produces the results as claimed, then it'll provide a better approach than the other candidates -- until something better emerges:

it's a Python library, which is much simpler to integrate than a JVM-based approach or as separate services (based on Docker, etc.)
the resulting JSON is better than most PDF text extraction
this approach decouples the PDF text extraction from the ML classifier about sections, unlike SPv1 or SPv2 which conflated those (unless I've misunderstood any points)

That will at least get us to the point of having reasonably good semi-structured text from PDFs, so that tools such as PyTextRank and other lightweight entity linking approaches can be used to extract key phrases from the papers in the corpus. That's needed for extending our workflow in RCGraph, see: https://github.com/Coleridge-Initiative/RCGraph/issues/40

A subsequent project would be to take the JSON produced by Parsr and train ML models to classify sections of a research paper. In other words, reproducing the intent of SPv2 but updated for our needs.

I have a strong hunch that some of the more lightweight modeling would be a good fit, perhaps such as BiLSTM, and also that we may engage Recognai with a contract for that :)

JasonZhangzy1757 commented 4 years ago

Hi @ceteri

Here are some sample code using a python interface to produce JSON：

# Module Import
from parsr_client import ParserClient
import json

# Initialize the Client Object
parsr = ParserClient('localhost:3001')

# Send Document for Processing
job = parsr.send_document(
    file='./sampleFile.pdf',
    config='./sampleConfig.json',
    document_name='Sample File2',
    wait_till_finished=True,
    save_request_id=True,
)

# Get the Full JSON Output
with open("./sample.json", 'w', encoding="utf-8") as outfile:
    json.dump(parsr.get_json(), outfile, indent=2, ensure_ascii=False)

More functions of this interface are provided in the jupyter notebook here.
More explanations of the JSON output could be found here.

And here is an random example PDF:

URL: https://www.kidney-international.org/article/S0085-2538(15)50504-8/pdf

The JSON output produced by Parsr is here.
Also, here are the code and output if we interpret the JSON locally and get all text on Page 1:

from parsr_output_interpreter import ParsrOutputInterpreter

parsr_interpreter = ParsrOutputInterpreter(parsr.get_json())
t = parsr_interpreter.get_text(page_number=1)

Output: https://github.com/JasonZhangzy1757/Parsr/blob/master/demo/jupyter-notebook/Output.txt

I would say this looks quite promising :)

If you think it's OK then I'll start working on py scripts.

ceteri commented 4 years ago

Nice work!! That looks great.

Yes, let's move ahead to Py scripts.

JasonZhangzy1757 commented 4 years ago

Hi @ceteri

Here is a simple version of the script. The script should be placed in the bin / directory together with its dependent package parsr_client.py. It seems to work well.

URL: https://github.com/JasonZhangzy1757/rclc/blob/master/bin/parsr.py

So, to summarize the entire procedure:

Extracting text from PDFs

We use Parsr to extract text and JSON from research publications. The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

-- The advanced guide is available here. --

Then run the parsr.py script to extract text and JSON from the PDF files:

python bin/parsr.py

The outputs will be saved in json and text folders. It might be quite time-consuming though. Perhaps we should also upload it onto AWS S3 Bucket.

ceteri commented 4 years ago

This looks good @JasonZhangzy1757 !

Go ahead and create a PR for this. And yes, the next step in this pipeline does upload the JSON to an S3 bucket https://github.com/Coleridge-Initiative/rclc/wiki/Downloading-Resource-Files#upload-pdf-and-json-files

I'll create an issue for the next stage to add to this pipeline, which is to run phrase extraction from this JSON output here.

ceteri commented 4 years ago

Here's a next stage to build: https://github.com/Coleridge-Initiative/rclc/issues/20

JasonZhangzy1757 commented 4 years ago

It seems GitHub doesn't support pull requests for the wiki repository. https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github

So I have edited the wiki page directly.

ceteri commented 4 years ago

great!

Coleridge-Initiative / rclc