BMKEG / lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.
GNU General Public License v3.0
82 stars 44 forks source link

problem to Classify pdf blocks #13

Closed juanbits closed 10 years ago

juanbits commented 11 years ago

hello, im having problems to identify the blocks from a pdf using blockifyClassify

i use

with: ./blockify folderinput/ folderoutput/

i get some like it:

<?xml version="1.0" encoding="UTF-8"?>

2012 IEEE WIC ACM International Conferences on Web Intelligence and Intelligent Agent Technology Indoor Localization and Guidance using Portable Smartphones ##

and with: ./blockifyClassify folderinput/ folderoutput/ general.dlr

i get

<?xml version="1.0" encoding="UTF-8"?>

my general.drl to identify the title is:

package edu.isi.bmkeg.pdf.classification.rules import edu.isi.bmkeg.lapdf.features.ChunkFeatures; import edu.isi.bmkeg.lapdf.model.ChunkBlock;

global ChunkBlock chunk;

rule "Title" activation-group "blockClassification" salience 4 when ChunkFeatures(pageNumber==1) ChunkFeatures(mostPopularFontSize==13) then chunk.setType(chunk.TYPE_TITLE);

end

the title of the pdf is with style="font-size:13pt"

can u helpme to do the right rule

thanks

juanbits commented 11 years ago

Test PDF: https://docs.google.com/file/d/0B3uZkLrOSoxbVmxQZlh2RFFUaE0/edit?usp=sharing

GullyAPCBurns commented 11 years ago

Ok. We've added a couple of things.

1) Full JaxB support for the Pubmed Central open access format. So we should be able to both read and write to the open access formats now. The system should generate fuller XML output into this format

2) A new project called lapdftext-rules that is designed to help develop unit tests for the development of test cases. I'd recommend using the Excel spreadsheets to generate rule files since that's easier for me. Keep an eye on this project since I'd like to develop a more complete library of rule files for use by the community.

Best

Gully

GullyAPCBurns commented 10 years ago

Any other things to add to this?