clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
31 stars 5 forks source link

Add scaffolding for NumbersPreprocessor #15

Closed kwalcock closed 2 years ago

MihaiSurdeanu commented 2 years ago

Cool cool.

kwalcock commented 2 years ago

@hubert10, can you make sense of this PR? It was suggested that a regular expression could solve the problem. There is an incorrect one in NumbersPreprocessor and some tests in TestNumbersPreprocessor that would fail when uncommented. Perhaps for an initial running on a corpus one would want to print some record of actions, so there is a simple logger. I guess I would worry about false positives. Once could otherwise do a diff between input and output to find the information.

hubert10 commented 2 years ago

Hi @kwalcock,

I added some code for merging numbers. The last two tests fail as they contain no-digits characters and I am trying to figure out how to handle it. I am not able to push to this repo though it is public. Do I have write access? or maybe something is wrong with my tokens.

Thanks

kwalcock commented 2 years ago

No, you didn't have write access. Sorry. Can you try again? Thanks.

On Mon, Feb 7, 2022 at 7:54 AM Hubert Kanyamahanga @.***> wrote:

Hi @kwalcock https://github.com/kwalcock,

I added some code for merging numbers. The last two tests fail as they contain no-digits characters and I am trying to figure out how to handle it. I am not able to push to this repo though it is public. Do I have write access? or maybe something is wrong with my tokens.

Thanks

— Reply to this email directly, view it on GitHub https://github.com/clulab/pdf2txt/pull/15#issuecomment-1031553650, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCHCOXX6SJYEJF7VQN7BJDUZ7MI7ANCNFSM5NKY4SAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

hubert10 commented 2 years ago

It works now, Thanks

kwalcock commented 2 years ago

My thought on those other tests was that sometimes the digits aren't really parts of mathematical numbers, but things like models. "Of model S42, 153 are in stock." So, to make sure it's just digits, the left d+ should be at the beginning or be preceded by s+ and the right d+ should be at the end or followed by s+. It was just an idea, though, and should be adapted to what happens in the text.

kwalcock commented 2 years ago

@hubert10, could you run Number2logDir on a problematic corpus? (I notice that you might have to change a path in the code or just not write the modified file content.)

When I did it, the only numbers being joined were actually ones separated across paragraphs (or lines), and that should probably not happen. The regex might be changed to avoid this, but there is also the option of iterating through the document by paragraph and only checking for number problems within a paragraph. That kind of thing has been necessary for other Preprocessors.

file    before  after
gueye2015effect.txt 0\n\n,0 0,0
gueye2015effect.txt 0\n\n,0 0,0
gueye2015effect.txt 0\n,0   0,0
gueye2015effect.txt 0\n\n,0 0,0
gueye2015effect.txt 0\n,0   0,0
gueye2015effect.txt 0\n,0   0,0
gueye2015effect.txt 0\n\n,0 0,0
hubert10 commented 2 years ago

@hubert10, could you run Number2logDir on a problematic corpus? (I notice that you might have to change a path in the code or just not write the modified file content.)

When I did it, the only numbers being joined were actually ones separated across paragraphs (or lines), and that should probably not happen. The regex might be changed to avoid this, but there is also the option of iterating through the document by paragraph and only checking for number problems within a paragraph. That kind of thing has been necessary for other Preprocessors.

file  before  after
gueye2015effect.txt   0\n\n,0 0,0
gueye2015effect.txt   0\n\n,0 0,0
gueye2015effect.txt   0\n,0   0,0
gueye2015effect.txt   0\n\n,0 0,0
gueye2015effect.txt   0\n,0   0,0
gueye2015effect.txt   0\n,0   0,0
gueye2015effect.txt   0\n\n,0 0,0

Hi Keith,

The corpus I have does not seem to have any candidates for merging so I asked Masha to share what she used for planted areas extraction. Once I have it I will rerun the Number2logDir.

For numbers separated across paragraphs (or lines), I still have one for which the StringUtils.escape(string) method does not parse:

file                              before  after
RS2018_estimating-sowing-date.txt 1\n\n,2      1,2
kwalcock commented 2 years ago

@hubert10, I'm not sure what you mean. StringUtils.escape(string) makes sure that the before value is there on one line between tabs instead of being

1

,2

It seems like the after value is what we hope for except for the being split across paragraphs. Would you like to fix that?