ConvertAPI / convertapi-library-python

A Python library for the ConvertAPI
https://www.convertapi.com
Other
77 stars 22 forks source link

Conversions takes significantly more time compared to converting in browser #46

Closed paulius-petkus closed 1 month ago

paulius-petkus commented 5 months ago

Tested PDF to Squeeze with 90MB Pdf file: https://www.convertapi.com/a/api/pdf-to-squeeze#snippet=python It took 4.5-5s to convert + file upload and download couple seconds. So <10s overall. image

With python lib it took >60s: image

laurynas-convertapi commented 5 months ago

Hi @paulius-petkus,

could you please measure individual steps (upload, api, download) to help us understand better where the major slowdown is.

You can measure like this:


start_time = time.time()
upload_io = convertapi.UploadIO(open(file, 'rb'))
print(upload_io.file_id)
print("Upload time taken: %s seconds" % (time.time() - start_time))

start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))

start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))

print("The PDF saved to %s" % saved_files)
paulius-petkus commented 5 months ago

@laurynas-convertapi Hello, sure.

Seems Upload is the biggest problem: image

laurynas-convertapi commented 5 months ago

@paulius-petkus thanks! Very helpful.

Can you try adding this to the to of the script and see if it helps to reduce upload time:

from http.client import HTTPConnection
HTTPConnection.__init__.__defaults__ = tuple(
    x if x != 8192 else 64*1024
    for x in HTTPConnection.__init__.__defaults__
)
paulius-petkus commented 5 months ago

@laurynas-convertapi I have added it to the very beginning of the script, but it didn't help. Or should I add it somewhere else?

My current code snippet and results: image

laurynas-convertapi commented 5 months ago

Yes, it should be at the top. Strange it didn't have any effect, because I think the issue might be related with python httplib, as described here: https://github.com/psf/requests/issues/2181#issuecomment-713823366

Here is another suggestion, maybe this would work? https://stackoverflow.com/a/39518613

paulius-petkus commented 4 months ago

Hi @laurynas-convertapi, as far as I understand you do not have these problems and conversion takes way less time for you? If the problem does not reproduce on your machine, I think I will investigate what is wrong.

laurynas-convertapi commented 4 months ago

@paulius-petkus yes, I can't see significant difference when uploading file via command line curl (tried both on linux and mac) and running python script (python version 3.9.14).

For me 100MB file upload takes ~15s on 5G connection.

If you have an option to run curl, I suggest testing like this:

time curl -F 'file=@files/100mb.pdf' https://v2.convertapi.com/upload
paulius-petkus commented 4 months ago

It seems everything is fine when I run it with curl. Here is my code that ran in windows command prompt:

@echo off
set start_time=%time%
curl -X POST https://v2.convertapi.com/upload?Secret=my_secret -F "File=@C:/Users/petku/Desktop/TEMP/large1.pdf"
set end_time=%time%
echo Start Time: %start_time%
echo End Time: %end_time%

it took few seconds: image

When I do the upload document with request instead of using our library, it also works fast (3-7s instead of 45-55s). My python version: 3.12.1

Here is full code with only secret changed:

import requests
import time
import convertapi
import tempfile

api_secret = 'XXXXXXXXX'

start_time = time.time()
file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
api_url = 'https://v2.convertapi.com/upload'
with open(file_path, 'rb') as file:
    response = requests.post(api_url, files={'file': file}, headers={'Authorization': f'Bearer {api_secret}'})
    upload_io = response.json()
print("Upload time taken: %s seconds" % (time.time() - start_time))

convertapi.api_secret = api_secret
start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io['Url']}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))

start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))

print("The PDF saved to %s" % saved_files)
laurynas-convertapi commented 4 months ago

Hi @paulius-petkus, very interesting, if the same file upload using requests library directly is significantly faster. Thanks for the example, I will investigate it.

laurynas-convertapi commented 4 months ago

@paulius-petkus could you please try comparing following examples on your system? I don't see the difference on mine, but maybe you'll get different results:

Your adjusted version with files:

import requests
import time

file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'

start_time = time.time()

with open(file_path, 'rb') as file:
    s = requests.Session()
    response = requests.post(api_url, files={'file': file})
    upload_io = response.json()
    print(upload_io['FileId'])

print("Upload time taken: %s seconds" % (time.time() - start_time))

Version with data (as library uses):

import requests
import time

file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'

start_time = time.time()

with open(file_path, 'rb') as file:
    s = requests.Session()
    headers = { 'Content-Disposition': "attachment; filename*=UTF-8''test.pdf" }
    response = s.post(api_url, data = file, headers = headers)
    upload_io = response.json()
    print(upload_io['FileId'])

print("Upload time taken: %s seconds" % (time.time() - start_time))
paulius-petkus commented 4 months ago

Hi, both of these solutions works fast for me. It takes ~2s to upload that 100mb file. It seems the problem occurs on windows, cause me and Tomas both use Windows and we both have this problem.

We are not familiar with python code. But maybe the problem is with those chunks / headers creating in lib? - Maybe we can use most simple upload logic? Example in my previous comment (below curl code) works fast.

laurynas-convertapi commented 4 months ago

@paulius-petkus by chunks, do you mean download chunks? https://github.com/ConvertAPI/convertapi-python/blob/master/convertapi/client.py#L35

I thought the issue is with file upload. Can you confirm this code is slow on windows?

start_time = time.time()
upload_io = convertapi.UploadIO(open(file, 'rb'))
print(upload_io.file_id)
print("Upload time taken: %s seconds" % (time.time() - start_time))

And if yes, let's try this, it mimics the logic of convertapi.UploadIO:

import requests
import time

file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'

start_time = time.time()

with open(file_path, 'rb') as file:
    s = requests.Session()
    s.headers.update({ 'User-Agent': 'ConvertAPI-Python/test' })
    s.verify = True
    headers = { 'Content-Disposition': "attachment; filename*=UTF-8''test.pdf" }
    response = s.post(api_url, data = file, headers = headers, timeout = 1800)
    upload_io = response.json()
    print(upload_io['FileId'])

print("Upload time taken: %s seconds" % (time.time() - start_time))
paulius-petkus commented 4 months ago

@laurynas-convertapi Yes you are absolutely right - the problem is with upload. I overseen client lib, sorry for the confusion.

Yes, I confirm that first code block is slow. It takes ~50s to upload my 100mb test file. Additionally, second code block is also the same slow. I have debugged and it seems, that "timeout" is causing the issue.

Without timeout upload is fast:

import requests
import time

file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
#file_path = 'C:/Users/petku/Desktop/TEMP/sdf.pdf'
api_url = 'https://v2.convertapi.com/upload'

start_time = time.time()

with open(file_path, 'rb') as file:
    s = requests.Session()
    response = requests.post(api_url, files={'file': file})
    upload_io = response.json()
    print(upload_io['FileId'])

print("Upload time taken: %s seconds" % (time.time() - start_time))

image

And whole conversion (secret changed):

import requests
import time
import convertapi
import tempfile

api_secret = 'XXXXXXX'

start_time = time.time()
file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
api_url = 'https://v2.convertapi.com/upload'

with open(file_path, 'rb') as file:
    response = requests.post(api_url, files={'file': file}, headers={'Authorization': f'Bearer {api_secret}'})
    upload_io = response.json()

print("Upload time taken: %s seconds" % (time.time() - start_time))

convertapi.api_secret = api_secret
start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io['Url']}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))

start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))

print("The PDF saved to %s" % saved_files)

image

laurynas-convertapi commented 4 months ago

@paulius-petkus very interesting finding that timeout param is causing the slowdown!

Which requests library version you are using? If not the latest, maybe you could check if you can reproduce this problem on latest 2.32.3 version?

Also could you please check if specifying timeout tuple (connect + request timeout) makes any difference:

response = s.post(api_url, data = file, headers = headers, timeout = (3,1800))

And, just out of curiosity, does smaller timeout makes any difference:

response = s.post(api_url, data = file, headers = headers, timeout = 60)
paulius-petkus commented 4 months ago

@laurynas-convertapi requests version is 2.32.3.

Checked with tuple and with small timeout: both scenarios are same slow. Here are output results: image

laurynas-convertapi commented 4 months ago

@paulius-petkus sad to see it doesn't help. apparently there is some bug in the underlying libraries on Windows which causes the slowdown when timeout is specified.

Could you please try one more thing - just use standard convertapi python library and set upload timeout to none:

convertapi.upload_timeout = None

I guess it should help. If it works, we could consider changing upload_timeout default to None.

paulius-petkus commented 4 months ago

@laurynas-convertapi yes this setting timeout to None improved results as expected.

The same 100mb file conversion results: image

tomasr78 commented 3 months ago

@laurynas-convertapi, what is the status of this issue?

laurynas-convertapi commented 1 month ago

@tomasr78 will change default upload and download to None: https://github.com/ConvertAPI/convertapi-library-python/pull/51

laurynas-convertapi commented 1 month ago

Released in version https://github.com/ConvertAPI/convertapi-library-python/releases/tag/v2.0.0