problem to Classify pdf blocks

juanbits commented 11 years ago

hello, im having problems to identify the blocks from a pdf using blockifyClassify

i use

with: ./blockify folderinput/ folderoutput/

i get some like it:

<?xml version="1.0" encoding="UTF-8"?>

2012 IEEE WIC ACM International Conferences on Web Intelligence and Intelligent Agent Technology Indoor Localization and Guidance using Portable Smartphones ##

and with: ./blockifyClassify folderinput/ folderoutput/ general.dlr

i get

<?xml version="1.0" encoding="UTF-8"?>

my general.drl to identify the title is:

package edu.isi.bmkeg.pdf.classification.rules import edu.isi.bmkeg.lapdf.features.ChunkFeatures; import edu.isi.bmkeg.lapdf.model.ChunkBlock;

global ChunkBlock chunk;

rule "Title" activation-group "blockClassification" salience 4 when ChunkFeatures(pageNumber==1) ChunkFeatures(mostPopularFontSize==13) then chunk.setType(chunk.TYPE_TITLE);

end

the title of the pdf is with style="font-size:13pt"

can u helpme to do the right rule

thanks

juanbits commented 11 years ago

Test PDF: https://docs.google.com/file/d/0B3uZkLrOSoxbVmxQZlh2RFFUaE0/edit?usp=sharing

GullyAPCBurns commented 11 years ago

Ok. We've added a couple of things.

1) Full JaxB support for the Pubmed Central open access format. So we should be able to both read and write to the open access formats now. The system should generate fuller XML output into this format

2) A new project called lapdftext-rules that is designed to help develop unit tests for the development of test cases. I'd recommend using the Excel spreadsheets to generate rule files since that's easier for me. Keep an eye on this project since I'd like to develop a more complete library of rule files for use by the community.

Best

Gully

GullyAPCBurns commented 10 years ago

Any other things to add to this?

BMKEG / lapdftext