clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
32 stars 5 forks source link

space-separated large numbers #48

Closed maxaalexeeva closed 2 years ago

maxaalexeeva commented 2 years ago

Some large numbers have thousands separated with a space, so they get tokenized separately. Do we want to normalize them somewhere in reading projects or during pdf to txt conversion? Here's a screenshot from a pdf with a couple of examples:

Screenshot from 2022-06-19 21-28-04

MihaiSurdeanu commented 2 years ago

I think this would be hard to detect... What do you think @kwalcock ?

kwalcock commented 2 years ago

The pdf2txt converter already has this capability. However, it is not turned on by default. I don't remember exactly why, but perhaps it was thought that something could consider those spaces significant. If you are working with the github project, you can change

case class Hyperparameters(joinWithSpaces: Boolean = false)

to use true.

The setting isn't available from the command line or a configuration file. I'll at least add the latter for now. The command line is already awfully crowded.

kwalcock commented 2 years ago

It probably won't stay this way forever, but temporarily one can make a configuration file like habitus.conf containing

Pdf2txt {
  numberParameters {
    joinWithSpaces = true
  }
}

and then call PdfToTextApp with -conf habitus. Otherwise one can edit Pdf2txt.conf and change the value from false to true.

maxaalexeeva commented 2 years ago

@kwalcock thank you!