fancier processing for science parse

maxaalexeeva commented 2 years ago

adding methods to find paragraph breaks within science parse sections (if there is a period followed by a line break and then followed by a capital letter within a science-parse "section", it is likely a new paragraph) + some minor cleanup within a section (basically translating my python science-parse cleaning code into scala, reasonably high confidence)

kwalcock commented 2 years ago

Cool. I had not yet even considered how to do converter-specific corrections. I'll look at it fairly carefully.

kwalcock commented 2 years ago

I'm not yet sure where to put this. There is already a general paragraph preprocessor that is used for all PDFs. If ScienceParse needs a special version, and that does seem reasonable, perhaps I should specialize a ParagraphPreprocessor and put the functionality there. Some factory class might take an argument that specifies which PDF converter was used and then return a ParagraphPreprocessor that matches it and includes your code. The page number remover is neat. Any idea how well it extends to other converters? I can see leaving that where it is for now. Hyphenation is supposed to be handled by some LineBreakPreprocessor and WordBreak...Preprocessors that not only check for the hyphen but have a language model to check to see if it reasonable to rejoin parts of the word. That wouldn't work if the hyphens have already been removed. Replacing \n with a space is also something done by other stages. So, some of the changes are at odds with the rest of the design. Do you see a good way to reconcile them?

kwalcock commented 2 years ago

There are also other possibilities, like calling it a HabitusScienceParseConverter or MashaScienceParseConverter that would normally be used without the other preprocessors that conflict. So, when -converter is this, make sure -paragraph is false because that will already have been done.

Your Python code might also be included in the project, even just for reference, especially if it is not managed elsewhere. I'm working on getting ScalaPy (https://scalapy.dev/) integrated to call Egoitz' code and yours might also be a good candidate.

maxaalexeeva commented 2 years ago

@kwalcock thanks for looking! I don't have strong attachments to what I did here other than the trimming part (regular trimming makes those within-section paragraphs indistinguishable) and the page numbers. The rest can (probably?) be skipped since it's done elsewhere although I think the patterns were reasonably consistent and I do not know if they do a better or a worse job than the language model.

The page number remover is neat. Any idea how well it extends to other converters?

Not sure, but looks like the \n\d+\n pattern (I now realize I made a mistake there and it should be +, not *) holds for this example from tika:

 mem-
218
bership.

But it does not work for the other examples, e.g.:

2198 WORLD DEVELOPMENT

this is a page header (or whatever it's called) and it interferes between page-broken parts of one paragraph :(

kwalcock commented 2 years ago

In some PDF converters it is possible to convert by page. There should only be one page number on each page and they are likely in ascending order. We're not making use of this information.

kwalcock commented 2 years ago

@maxaalexeeva, can you explain this comment more?

// This handles page numbers that end up inside a paragraph when the paragraph break is between pages

At that point in the code there are sections and each section is divided into paragraphs. Does the comment mean that one paragraph is on one page and the next is on the next page and then the page number might be towards the end of the first paragraph and/or near the beginning of the next paragraph? Or does it mean more that there is one paragraph that spans across two pages and there are page numbers inside it, maybe more towards the middle. It might not especially make a difference, but it would help me understand the situation. In the first there would probably be only one page number to remove. In the second there might be two, for example. In either case the replaceAll could be too many and take out values from tables or other things.

The "assimilated" version is being worked on in #42.

maxaalexeeva commented 2 years ago

I think it's the second (paragraph spanning pages such the number is mid-paragraph)

kwalcock commented 2 years ago

This is being replaced by #42. The changes of this PR are recorded in a commit there in case they need to be recovered.

clulab / pdf2txt

fancier processing for science parse #37