clulab / pdf2txt

Convert PDF files to TXT
Apache License 2.0
32 stars 5 forks source link

Preprocess #4

Closed kwalcock closed 2 years ago

kwalcock commented 2 years ago

This still needs to be tested on a larger scale and for effectiveness. There is a TODO left for the language model. It is otherwise arranged like I think it should be and is working.

MihaiSurdeanu commented 2 years ago

This looks great, thanks @kwalcock !

One wrinkle: the language model will probably be in Python using HuggingFace transformers. I think we will need a py4j interface to this LM that follows yours.

kwalcock commented 2 years ago

OK. When the time comes...