Closed juanbits closed 10 years ago
Ok. We've added a couple of things.
1) Full JaxB support for the Pubmed Central open access format. So we should be able to both read and write to the open access formats now. The system should generate fuller XML output into this format
2) A new project called lapdftext-rules that is designed to help develop unit tests for the development of test cases. I'd recommend using the Excel spreadsheets to generate rule files since that's easier for me. Keep an eye on this project since I'd like to develop a more complete library of rule files for use by the community.
Best
Gully
Any other things to add to this?
hello, im having problems to identify the blocks from a pdf using blockifyClassify
i use
with: ./blockify folderinput/ folderoutput/
i get some like it:
<?xml version="1.0" encoding="UTF-8"?>
and with: ./blockifyClassify folderinput/ folderoutput/ general.dlr
i get
<?xml version="1.0" encoding="UTF-8"?>
my general.drl to identify the title is:
package edu.isi.bmkeg.pdf.classification.rules import edu.isi.bmkeg.lapdf.features.ChunkFeatures; import edu.isi.bmkeg.lapdf.model.ChunkBlock;
global ChunkBlock chunk;
rule "Title" activation-group "blockClassification" salience 4 when ChunkFeatures(pageNumber==1) ChunkFeatures(mostPopularFontSize==13) then chunk.setType(chunk.TYPE_TITLE);
end
the title of the pdf is with style="font-size:13pt"
can u helpme to do the right rule
thanks