MedKhem / grobid-dictionaries

31 stars 7 forks source link

Check why pdf2xml is run 3 times for each pdf #9

Closed lfoppiano closed 7 years ago

lfoppiano commented 7 years ago

here the log from the generation of training data for 1 pdf:

18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of Initialization of dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating names
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of names
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating country codes
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of country codes
18 May 2017 08:12.02 [INFO ] WapitiModel               - Loading model: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti (size: 155377)
[Wapiti] Loading model: "/Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti"
Model path: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti
18 May 2017 08:12.02 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:95ms
18 May 2017 08:12.03 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:47ms
1 files to be processed.
1 files processed in 454 milliseconds
Johan:grobid-dictionaries lfoppiano$ ls
lfoppiano commented 7 years ago

I've fixed it, by doing some diet in the segmentation-body and lexical entry parser