acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
384 stars 256 forks source link

Extract abstracts from PDF #395

Open akoehn opened 5 years ago

akoehn commented 5 years ago

The anthology currently only shows the abstracts if there is an authoritative version in the XML. It would be nice if we could scrape the PDF using some off-the-shelf software to extract the abstracts and dump them into a different file (to not tamper with handcrafted information). Having an abstract on the web pages makes quickly searching through literature much faster.

davidweichiang commented 5 years ago

I'm currently using Tika to extract author names from PDFs. It works very well on modern PDFs, but not so well on the older PDFs (roughly, 2000 and earlier). Unfortunately, it's also the older PDFs that lack abstracts.

mbollmann commented 4 years ago

Here's a file with automatically extracted abstracts.

I thought I'd try extracting abstracts from the ACL Anthology Reference Corpus. Concretely, I used the March 2016 version of the ParsCit XML and:

Some stats:

This process works well for many files, but also produces silly results in some cases. The most common problem appears to be the parser not correctly identifying the "Abstract" section. Still, we could consider using this as a starting point maybe?

EDIT: Here's the script I used.

davidweichiang commented 4 years ago

Do we know how ParsCit compares with GROBID?

mbollmann commented 4 years ago

I don't; maybe @knmnyn knows?

I briefly tried Tika on a couple of cases that my extraction process got wrong, and it handled them better. Maybe we could combine pipelines – run Tika on all our PDFs, and if it matches what we get from the ACL ARC, treat it as trustworthy enough to add it; manually check only the remaining cases.

knmnyn commented 4 years ago

@mbollmann @davidweichiang : GROBID is much more functional than my group's legacy tool, being able to ingest PDFs natively. My group is trying to catch up and build a Tika pipeline for feeding in data to a NN (word+char embedder+BiLSTM+CRF) pipeline for extraction. Any hints would be welcomed, and my group could definitely add in some effort towards this task! Definitely of mutual interest!

abhinavkashyap commented 4 years ago

Hello Everyone. I am Abhinav Ramesh Kashyap. I am a PhD student at NUS with Prof Min. Like he mentioned in the previous post, we have been working on reliable pipelines for scientific document processing and we have a framework called SciWING. You can checkout SciWING at sciwing.io

I have developed a solution to extract abstracts from PDF. It reads a pdf using PdfBox and classifies the lines of the document (a Glove + Elmo + Bilstm network for now). I have attached a screenshot here. The other screenshots are available https://github.com/abhinavkashyap/sciwing/tree/master/screenshots/acl_anthology_abstracts.

Please let us know how we can further our efforts to help ACL Anthology. I am trying to understand the specifics of this issue

SAMPLE OUTPUT

ACL ANTHOLOGY PAPER: https://www.aclweb.org/anthology/W19-4505/

ABSTRACT

In this work we propose to leverage resources available with discourse-level annotations to facilitate the identification of argumentative components and relations in scientific texts, which has been recognized as a particularly challenging task. In particular, we implement and evaluate a transfer learning approach in which contextualized representations learned from discourse parsing tasks are used as input of argument mining models. As a pilot application, we explore the feasibility of using automatically identified argumentative components and relations to predict the acceptance of papers in computer science venues. In order to conduct our experiments, we propose an annotation scheme for argumentative units and relations and use it to enrich an existing corpus with an argumentation layer.1

N19-1182

akoehn commented 4 years ago

Great! We need plaintext abstracts for the papers that have no abstract in the XML file.

For example, I could provide you with a list of pdf urls that need an abstract and you could provide a plaintext abstract for each of these.

abhinavkashyap commented 4 years ago

Okay sure. Please send me the files. I will run it through the system and provide the plain text abstracts 👍

akoehn commented 4 years ago

To re-create this file, use this command in the data/xml directory:

xmlstarlet sel -t -m '//paper[not(abstract)]' -v $'concat(url, "\n")' *xml | sed '/http/! s|\(.*\)|http://www.aclweb.org/anthology/\1.pdf|' > no-abstract.txt

These are about 40k files, so you may not want to run all of them at once ... no-abstract.txt

abhinavkashyap commented 4 years ago

Thank you for this! Will try this and get back to you soon.

abhinavkashyap commented 4 years ago

Update

Hi @akoehn I have just run a hundred pdfs through our system and I am attaching the abstracts here. Over all, the abstracts look okay. However, here are some problems that I encountered with these pdfs

I am attaching the extracted abstracts and a log file here. Please let me know if this is satisfactory and I will run through the other pdfs.

In the mean time, I can discuss with you and @knmnyn to annotate more data and make the system more robust.

abstracts.zip

mbollmann commented 4 years ago

First of all, thanks for offering your help @abhinavkashyap! That SciWING pipeline looks really cool.

I've clicked through a few of the abstracts and observed that several of those looked better in the file I generated from ACL-ARC, e.g. A00-1002, A00-1008, A00-1013, ...

That said, the files that you picked are also among the more challenging ones I'd think. It would be interesting to look at some results for P16-* papers, for example, which should be easier to extract text from (since they're almost all LaTeX-generated) and are also missing abstracts in the Anthology.

abhinavkashyap commented 4 years ago

Thanks for this @mbollmann. Thanks for the suggestion to run it to run it on the P16-* series of papers. I will give that a try and let you guys know.

mbollmann commented 4 years ago

@abhinavkashyap, do you just run the PDFs through SciWing to get the abstracts or is there more pre-/post-processing involved? I'm asking because I have a simple pipeline now of manually written heuristics to detect the abstract in Tika output (which I started working on before you offered your help), and am wondering if there's potential to pool our resources to get the best result possible.

abhinavkashyap commented 4 years ago

Hi @mbollmann. I just read the pdf and run SciWING on it. I check for a section headernamed abstract and continue to collect all the lines until I find another section header. There is not much post-processing done as well. I remove if there is any hyphenation at the end of every line - in no intelligent way. It will be good if we can pool our methods if it helps to get the abstracts for all the 40k pdfs.

I also saw your approach of using the ACL ARC which are labelled by neural parscit. I think, I will use the ACL ARC data to further train SciWING. Right now SciWING is trained on very small amount of hand-annotated data. The ACL ARC dataset can serve as pseudo labels to improve the performance

abhinavkashyap commented 4 years ago

@mbollmann @akoehn . I ran the system through all **P-16*** papers as suggested by @mbollmann . I am attaching the extracted abstracts here. It took around 10 seconds on the GPU to extract one abstract and around 1 hour to extract everything. All the abstracts look pretty okay. Do let me know your thoughts. Thanks

https://drive.google.com/file/d/17lL_sh0ylj0yLr39DXUWtzHZgYe3_My3/view?usp=sharing

knmnyn commented 4 years ago

Thanks, Abhinav. Did you see any problems with the extraction?

On Sat, 2 May 2020 at 21:41, Abhinav Ramesh Kashyap < notifications@github.com> wrote:

@mbollmann https://github.com/mbollmann @akoehn https://github.com/akoehn . I ran the system for *P-16* papers as suggested by @mbollmann https://github.com/mbollmann . I am attaching the extracted abstracts here. It took around 10 seconds on the GPU to extract one GPU and around 1 hour to extract all the P-16** abstracts. Do let me know your thoughts. Thanks Uploading p16_abstracts.zip… http://Abstracts

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/395#issuecomment-622955581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABU72ZGMBHUAD5OJHF2WKTRPQPIZANCNFSM4HVAHRLA .

--

  • M
abhinavkashyap commented 4 years ago

Hello Prof Min. I didn't encounter any computational problems with the P16-* papers. The extracted abstracts look okay for now. The problem is with the older PDFs. The machine learning model is not robust to noise in the data.

knmnyn commented 4 years ago

@mbollmann, @akoehn : Any input on the P16 abstracts? I think after talking with @abhinavkashyap now, we think that you'd be able tell us whether there's a good way to combine SciWING output and the Tika output to get the most appropriate output.

Where @abhinavkashyap needs most help is to create clean(er) plain text from the PDF. Those quirks about A B S T R A C T and other hyphenation or mis-recognition may be the bottleneck for most of the errors. We believe that the abstract extraction itself is solvable.

mbollmann commented 4 years ago

I can look at the abstracts sometime later this week, and will also compare them to what my simple pipeline produces.

For dealing with hyphenation, I currently have a simply heuristic based on the wordfreq package: hyphenation is removed iff the un-hyphenated word exists (has frequency > 0) and has a higher frequency than the components on their own. It seems to work well, but I'll have to take a closer look.

For getting cleaner text from the PDFs, I don't have a ready-made solution. Have you looked at why the A00 abstracts appear to be worse than what I got from ParsCit? Would it make sense to somehow utilize the ParsCit versions (from ACL ARC) as a second signal?

Other than that, OCR post-correction is a thing, right? Surely NLP must have produced a tool somewhere that can help with this... :-) If no-one has any concrete pointers, I can also do some research here later this week or the next.

knmnyn commented 4 years ago

Just some background: for older PDFs in the Anthology (especially ones that were scanned in as rasters), I ran Adobe Acrobat on the original PDF sources to insert a machine readable layer. I'm pretty sure I replaced the original documents with this enhanced ones, so the current Anthology and the ACL ARC should have exactly the same PDF files.

But the text extraction in the ACL ARC is better, that is because we used a commercial OCR system (Nuance's Omnipage, then version 15) to extract the text. It was brutal because it needed to be run on a Windows Server pipeline that crashed unpredictably.

@abhinavkashyap if you want to use the text from the ACL ARC, you can just take it directly from the directory structure there. We did try to organize it well so it should be pretty transparent. The canonical v2 of the ACL ARC still sits on the VM at acl-arc.comp.nus.edu.sg .

mbollmann commented 4 years ago

FWIW, I don't think attaching the P16 abstracts to your issue worked, @abhinavkashyap

knmnyn commented 4 years ago

Yes, I see that that didn't work. Funny I tried the link a few days back and seem to recall that it was working. @abhinavkashyap perhaps can you try again or put a link to an open GDrive file?

mbollmann commented 4 years ago

Re the hyphenation, I browsed through P16 and my approach currently fails for some words that wordfreq apparently doesn't know about:

annota-tors
corefer-ence
geospa-tional
la-belers
reg-ularizer
rerank-ing
sum-marizer
system-aticity

It also fails with proper nouns:

MaltOpti-mizer
Morfes-sor

But in general, the cases where the hyphenation is correctly not removed (phrase-based, context-aware, low-resource, word-level, ...) are the vast majority.

abhinavkashyap commented 4 years ago

@mbollmann @knmnyn Here is the link for the abstracts https://drive.google.com/file/d/17lL_sh0ylj0yLr39DXUWtzHZgYe3_My3/view?usp=sharing

mbollmann commented 4 years ago

Thanks @abhinavkashyap! They generally look very good to me. I compared them with my own Tika pipeline, and they're mostly identical, and also appear to have the same problems; e.g., the footnotes in P16-1036 are interpreted as being part of the abstract.

My thoughts on how to proceed: I've now extracted 38k+ abstracts from a combination of ACL ARC and my own Tika pipeline. I think it would make sense to compile a list of volumes where this approach produces many bad results, and then focus our efforts on those and see if we can improve them with SciWING. However, this means I have 330 XML files to skim through, so it might take me a bit :)

abhinavkashyap commented 4 years ago

Thanks @mbollmann . Please let me know when there is an update on the situation 👍 :)

abhinavkashyap commented 3 years ago

Hi @mbollmann and @davidweichiang . Do we have any updates for this ?