Use stdin for pdftohtml

bitextor / pdf-extract

PDF parser and converter to HTML

GNU General Public License v3.0

83 stars 14 forks source link

Use stdin for pdftohtml #42

Closed kpu closed 4 years ago

kpu commented 4 years ago

I just tested

pdftohtml a.pdf foo.html

and

pdftohtml /dev/stdin foo.html <a.pdf

and got identical results.

So all the temporary file stuff is inefficient and unnecessary.

https://github.com/bitextor/pdf-extract/blob/4ad28a23817851355ba65b6b4699a8f01b2cb760/src/pdfextract/PDFToHtml.java#L43

As an aside, I'm not sure why you made a random string for createTempFile in the file suffix. It does the random string part for you.

kpu commented 4 years ago

Note that pdftohtml - foo.html also works and might be more cross-platform.

kpu commented 4 years ago

Actually, you can do both stdin and stdout like so, but it requires a filename it won't use for. . . reasons.
pdftohtml -s -i -noframes -xml -stdout -fontfullname - nonsense <a.pdf

ramoelee commented 4 years ago

Hi @kpu

STDIN : the temporary file still needed, because it used when the input parameter is ByteArrayInputStream, the system has to create a temporary file for store the ByteArrayInputStream, the temporary file will be used for calling pdftohtml and will be deleted after process done.
STDOUT: The system will be modified to use stdout to get XML data directly instead of reading from a temporary XML file.

Thank & Regard Romuelee

kpu commented 4 years ago

You need to copy bytes from the user's ByteArrayInputStream to the subprocesses's getOutputStream(). getOutputStream() returns OutputStream which has a write method that accepts bytes.

You'll probably want a thread for it.

lpla commented 3 years ago

The commit that fixed this issue (https://github.com/bitextor/pdf-extract/commit/6182d33afc2f56f3a2d7a5639712e1e9f54a96f0) introduced this other issue: https://github.com/bitextor/pdf-extract/issues/56