Closed maxaalexeeva closed 2 years ago
I think this would be hard to detect... What do you think @kwalcock ?
The pdf2txt converter already has this capability. However, it is not turned on by default. I don't remember exactly why, but perhaps it was thought that something could consider those spaces significant. If you are working with the github project, you can change
case class Hyperparameters(joinWithSpaces: Boolean = false)
to use true
.
The setting isn't available from the command line or a configuration file. I'll at least add the latter for now. The command line is already awfully crowded.
It probably won't stay this way forever, but temporarily one can make a configuration file like habitus.conf
containing
Pdf2txt {
numberParameters {
joinWithSpaces = true
}
}
and then call PdfToTextApp
with -conf habitus
. Otherwise one can edit Pdf2txt.conf
and change the value from false
to true
.
@kwalcock thank you!
Some large numbers have thousands separated with a space, so they get tokenized separately. Do we want to normalize them somewhere in reading projects or during pdf to txt conversion? Here's a screenshot from a pdf with a couple of examples: