allenai / papermage

library supporting NLP and CV research on scientific papers
https://papermage.org
Apache License 2.0
659 stars 51 forks source link

problems when parsing older paper in PDF format #61

Open SherryPan0 opened 8 months ago

SherryPan0 commented 8 months ago

Hi, thanks for this great toolkit! I tried the papermage with several PDF files. It works really well with recent papers but when I tried to parse some papers published in 1980 or 1989, papermage failed to parse the sentences.


doc = recipe.run("1980.pdf")
for sen in doc.sentences:
    print(sen.text)
'''
output:
Received
January
1978;
revised
October
1979;
accepted
December 1979
References
1.
Avery,
K.
R.
,
and
Avery,
C.
A.
Design
and
development
of an interactive
statistical
system
(SIPS).
Proc.
Comptr.
Sci.
and
Statistics: 8th
Ann.
Symp.
on
'''
kyleclo commented 8 months ago

Interesting! could you send me the PDF so I can have a look at it? older PDFs not something we really investigated much

SherryPan0 commented 8 months ago

1980.pdf 1989.pdf These are the two PDF files that I have tested. Thanks!