gen2brain / go-fitz

Golang wrapper for the MuPDF Fitz library
GNU Affero General Public License v3.0
369 stars 87 forks source link

ImagePNG() is a LOT slower than ImageDPI() ? #69

Closed paulolops closed 1 year ago

paulolops commented 1 year ago

Hello,

I'm encountering a strange performance issue where between :

  1. using ImageDPI to create a file that is later read into a byte array
  2. directly using ImagePNG (to economise the creating/reading/deleting of these png files) I encounter an ENORMOUS difference in performance going from 23s for 47 pages for ImageDPI() to 30s for 6 pages for ImagePNG() is there any reason that could cause this much difference ?

The pages I pass to these functions are sent concurrently and results stocked onto channels. The process is running over an AWS Lambda function with more than 10GB of memory (only 300MB of it used at the end according to AWS)

Please tell me if you need additionnal information.

Here is the two different ways I run these functions :

With ImageDPI This is running in parallel

img, errImg := doc.ImageDPI(n, 500)
    if errImg != nil {
        errors <- errImg
        return
    }

    f, errWrite := os.Create(filepath.Join(outputDir, fmt.Sprintf("%03d.png", n)))
    if errWrite != nil {
        errors <- errWrite
        return
    }

    defer f.Close()

    errEncode := png.Encode(f, img)
    if errEncode != nil {
        errors <- errEncode
        return
    }

Then the images are read later on to get bytes back

imageBytes, err := ioutil.ReadFile(imageFile)
if err != nil {
    logr.Errorln(err)
    resultsChan <- Result{Error: err}
}

With ImagePNG This is running in parallel

imgBytes, errImg := doc.ImagePNG(n, 500)
if errImg != nil {
    result <- ImageFromPdf{Error: errImg}
    return
}

logr.Infoln("done creating png for page ", n)
result <- ImageFromPdf {
    ImageBytes: imgBytes,
    Error:      nil,
}

then the channel results are stocked in allPagesBytes

var allPagesBytes [][]byte
for n := 0; n < numPage; n++ {
    result := <-resultsChan
    if result.Error != nil {
        logr.Errorln("failed creating imgs from pdf :", result.Error)
        return nil, result.Error
    } else {
        allPagesBytes = append(allPagesBytes, result.ImageBytes)
    }
}

Please tell me if you need any extra information, Thanks for the help, Baptiste FIORiNA

paulolops commented 1 year ago

I went ahead and ran the whole function locally without aws Lambda here are the results when ran with unix "time" command before the command line :

with ImageDPI 70.14s user 4.51s system 326% cpu 22.828 total

with ImagePNG 40.45s user 5.97s system 88% cpu 52.686 total

The user time seems to be the total one

Maybe I understand it wrong but the ImagePNG seems to be handling parallelization quite worse than ImageDPI ?

gen2brain commented 1 year ago

@paulolops ImagePNG was added because it was a faster method, to directly get PNG bytes from MuPDF. None of the Image* functions support concurrency, see https://github.com/gen2brain/go-fitz/issues/4, so you probably should not use it in such a way.