Coleridge-Initiative / rclc

Rich Context leaderboard competition, including the corpus and current SOTA for required tasks.
https://coleridgeinitiative.org/richcontext
Creative Commons Zero v1.0 Universal
21 stars 6 forks source link

Replace SPv1 with a better PDF parser #10

Closed ceteri closed 4 years ago

ceteri commented 4 years ago

Rework the pipeline, after PDF download, so that text gets extracted in a semi-structured way.

Some options to evaluate:

If a package has any dependencies on JVM, then it's best to containerize that part of the workflow with Docker; we don't want to be managing JVM apps. As much as possible, use instances from DockerHub instead of creating new ones.

In the case of grobid there are already several DockerHub instances: https://hub.docker.com/search?q=grobid&type=image

ceteri commented 4 years ago
  1. Whoever works on this issue, please talk with @philipskokoh to compare notes

  2. Then update instructions for launch and use in Downloading-Resource-Files

philipskokoh commented 4 years ago

Hi @ceteri, I think spv2 is not a good option since it's no longer being maintained (deprecated). The current spv2 only parses title, authors, and bibliographies (see this issue). It also skews toward medical documents (see https://github.com/allenai/spv2/issues/42#issuecomment-444665509).

Another option that might be better is grobid.

ceteri commented 4 years ago

Many thanks @philipskokoh !

ceteri commented 4 years ago

@JasonZhangzy1757:

  • evaluating use of Parsr and now I had the code running and could generate raw Txt and JSON from PDF.
  • Parsr provides UI and python client, so I think it is not hard for us to use.
  • What kind of evaluation metrics do you think we should perform before deciding which parser is simpler to use.

Great! I'll describe about evaluation metrics in a following comment here.

For next steps:

  1. please add into comments on this issue some sample code for calling as a Python client to produce JSON
  2. then, also in comments, show a URL for an example PDF and attach the JSON file produced by Parsr as a sample
  3. write a Py script to scan the list of PDFs in the resources/pub/pdf subdirectory of this repo
  4. then have that Py script use Parsr to produce JSON files in resources/pub/json -- replacing what is described as our previous approach
ceteri commented 4 years ago

Evaluation Metrics

As long as Parsr is not difficult to use and it produces the results as claimed, then it'll provide a better approach than the other candidates -- until something better emerges:

  1. it's a Python library, which is much simpler to integrate than a JVM-based approach or as separate services (based on Docker, etc.)
  2. the resulting JSON is better than most PDF text extraction
  3. this approach decouples the PDF text extraction from the ML classifier about sections, unlike SPv1 or SPv2 which conflated those (unless I've misunderstood any points)

That will at least get us to the point of having reasonably good semi-structured text from PDFs, so that tools such as PyTextRank and other lightweight entity linking approaches can be used to extract key phrases from the papers in the corpus. That's needed for extending our workflow in RCGraph, see: https://github.com/Coleridge-Initiative/RCGraph/issues/40

A subsequent project would be to take the JSON produced by Parsr and train ML models to classify sections of a research paper. In other words, reproducing the intent of SPv2 but updated for our needs.

I have a strong hunch that some of the more lightweight modeling would be a good fit, perhaps such as BiLSTM, and also that we may engage Recognai with a contract for that :)

JasonZhangzy1757 commented 4 years ago

Hi @ceteri

Here are some sample code using a python interface to produce JSON:

# Module Import
from parsr_client import ParserClient
import json

# Initialize the Client Object
parsr = ParserClient('localhost:3001')

# Send Document for Processing
job = parsr.send_document(
    file='./sampleFile.pdf',
    config='./sampleConfig.json',
    document_name='Sample File2',
    wait_till_finished=True,
    save_request_id=True,
)

# Get the Full JSON Output
with open("./sample.json", 'w', encoding="utf-8") as outfile:
    json.dump(parsr.get_json(), outfile, indent=2, ensure_ascii=False)

And here is an random example PDF:

URL: https://www.kidney-international.org/article/S0085-2538(15)50504-8/pdf

from parsr_output_interpreter import ParsrOutputInterpreter

parsr_interpreter = ParsrOutputInterpreter(parsr.get_json())
t = parsr_interpreter.get_text(page_number=1)

Output: https://github.com/JasonZhangzy1757/Parsr/blob/master/demo/jupyter-notebook/Output.txt

I would say this looks quite promising :)

If you think it's OK then I'll start working on py scripts.

ceteri commented 4 years ago

Nice work!! That looks great.

Yes, let's move ahead to Py scripts.

JasonZhangzy1757 commented 4 years ago

Hi @ceteri

Here is a simple version of the script. The script should be placed in the bin / directory together with its dependent package parsr_client.py. It seems to work well.

URL: https://github.com/JasonZhangzy1757/rclc/blob/master/bin/parsr.py

image


So, to summarize the entire procedure:

Extracting text from PDFs

We use Parsr to extract text and JSON from research publications. The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

-- The advanced guide is available here. --

Then run the parsr.py script to extract text and JSON from the PDF files:

python bin/parsr.py

The outputs will be saved in json and text folders. It might be quite time-consuming though. Perhaps we should also upload it onto AWS S3 Bucket.

ceteri commented 4 years ago

This looks good @JasonZhangzy1757 !

Go ahead and create a PR for this. And yes, the next step in this pipeline does upload the JSON to an S3 bucket https://github.com/Coleridge-Initiative/rclc/wiki/Downloading-Resource-Files#upload-pdf-and-json-files

I'll create an issue for the next stage to add to this pipeline, which is to run phrase extraction from this JSON output here.

ceteri commented 4 years ago

Here's a next stage to build: https://github.com/Coleridge-Initiative/rclc/issues/20

JasonZhangzy1757 commented 4 years ago

It seems GitHub doesn't support pull requests for the wiki repository. https://stackoverflow.com/questions/10642928/how-to-pull-request-a-wiki-page-on-github

So I have edited the wiki page directly.

ceteri commented 4 years ago

great!