Closed paulius-petkus closed 1 month ago
Hi @paulius-petkus,
could you please measure individual steps (upload, api, download) to help us understand better where the major slowdown is.
You can measure like this:
start_time = time.time()
upload_io = convertapi.UploadIO(open(file, 'rb'))
print(upload_io.file_id)
print("Upload time taken: %s seconds" % (time.time() - start_time))
start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))
start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))
print("The PDF saved to %s" % saved_files)
@laurynas-convertapi Hello, sure.
Seems Upload is the biggest problem:
@paulius-petkus thanks! Very helpful.
Can you try adding this to the to of the script and see if it helps to reduce upload time:
from http.client import HTTPConnection
HTTPConnection.__init__.__defaults__ = tuple(
x if x != 8192 else 64*1024
for x in HTTPConnection.__init__.__defaults__
)
@laurynas-convertapi I have added it to the very beginning of the script, but it didn't help. Or should I add it somewhere else?
My current code snippet and results:
Yes, it should be at the top. Strange it didn't have any effect, because I think the issue might be related with python httplib, as described here: https://github.com/psf/requests/issues/2181#issuecomment-713823366
Here is another suggestion, maybe this would work? https://stackoverflow.com/a/39518613
Hi @laurynas-convertapi, as far as I understand you do not have these problems and conversion takes way less time for you? If the problem does not reproduce on your machine, I think I will investigate what is wrong.
@paulius-petkus yes, I can't see significant difference when uploading file via command line curl (tried both on linux and mac) and running python script (python version 3.9.14).
For me 100MB file upload takes ~15s on 5G connection.
If you have an option to run curl, I suggest testing like this:
time curl -F 'file=@files/100mb.pdf' https://v2.convertapi.com/upload
It seems everything is fine when I run it with curl. Here is my code that ran in windows command prompt:
@echo off
set start_time=%time%
curl -X POST https://v2.convertapi.com/upload?Secret=my_secret -F "File=@C:/Users/petku/Desktop/TEMP/large1.pdf"
set end_time=%time%
echo Start Time: %start_time%
echo End Time: %end_time%
it took few seconds:
When I do the upload document with request instead of using our library, it also works fast (3-7s instead of 45-55s). My python version: 3.12.1
Here is full code with only secret changed:
import requests
import time
import convertapi
import tempfile
api_secret = 'XXXXXXXXX'
start_time = time.time()
file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
api_url = 'https://v2.convertapi.com/upload'
with open(file_path, 'rb') as file:
response = requests.post(api_url, files={'file': file}, headers={'Authorization': f'Bearer {api_secret}'})
upload_io = response.json()
print("Upload time taken: %s seconds" % (time.time() - start_time))
convertapi.api_secret = api_secret
start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io['Url']}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))
start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))
print("The PDF saved to %s" % saved_files)
Hi @paulius-petkus, very interesting, if the same file upload using requests library directly is significantly faster. Thanks for the example, I will investigate it.
@paulius-petkus could you please try comparing following examples on your system? I don't see the difference on mine, but maybe you'll get different results:
Your adjusted version with files
:
import requests
import time
file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'
start_time = time.time()
with open(file_path, 'rb') as file:
s = requests.Session()
response = requests.post(api_url, files={'file': file})
upload_io = response.json()
print(upload_io['FileId'])
print("Upload time taken: %s seconds" % (time.time() - start_time))
Version with data
(as library uses):
import requests
import time
file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'
start_time = time.time()
with open(file_path, 'rb') as file:
s = requests.Session()
headers = { 'Content-Disposition': "attachment; filename*=UTF-8''test.pdf" }
response = s.post(api_url, data = file, headers = headers)
upload_io = response.json()
print(upload_io['FileId'])
print("Upload time taken: %s seconds" % (time.time() - start_time))
Hi, both of these solutions works fast for me. It takes ~2s to upload that 100mb file. It seems the problem occurs on windows, cause me and Tomas both use Windows and we both have this problem.
We are not familiar with python code. But maybe the problem is with those chunks / headers creating in lib? - Maybe we can use most simple upload logic? Example in my previous comment (below curl code) works fast.
@paulius-petkus by chunks, do you mean download chunks? https://github.com/ConvertAPI/convertapi-python/blob/master/convertapi/client.py#L35
I thought the issue is with file upload. Can you confirm this code is slow on windows?
start_time = time.time()
upload_io = convertapi.UploadIO(open(file, 'rb'))
print(upload_io.file_id)
print("Upload time taken: %s seconds" % (time.time() - start_time))
And if yes, let's try this, it mimics the logic of convertapi.UploadIO
:
import requests
import time
file_path = 'files/test100mb.pdf'
api_url = 'https://v2.convertapi.com/upload'
start_time = time.time()
with open(file_path, 'rb') as file:
s = requests.Session()
s.headers.update({ 'User-Agent': 'ConvertAPI-Python/test' })
s.verify = True
headers = { 'Content-Disposition': "attachment; filename*=UTF-8''test.pdf" }
response = s.post(api_url, data = file, headers = headers, timeout = 1800)
upload_io = response.json()
print(upload_io['FileId'])
print("Upload time taken: %s seconds" % (time.time() - start_time))
@laurynas-convertapi Yes you are absolutely right - the problem is with upload. I overseen client lib, sorry for the confusion.
Yes, I confirm that first code block is slow. It takes ~50s to upload my 100mb test file. Additionally, second code block is also the same slow. I have debugged and it seems, that "timeout" is causing the issue.
Without timeout upload is fast:
import requests
import time
file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
#file_path = 'C:/Users/petku/Desktop/TEMP/sdf.pdf'
api_url = 'https://v2.convertapi.com/upload'
start_time = time.time()
with open(file_path, 'rb') as file:
s = requests.Session()
response = requests.post(api_url, files={'file': file})
upload_io = response.json()
print(upload_io['FileId'])
print("Upload time taken: %s seconds" % (time.time() - start_time))
And whole conversion (secret changed):
import requests
import time
import convertapi
import tempfile
api_secret = 'XXXXXXX'
start_time = time.time()
file_path = 'C:/Users/petku/Desktop/TEMP/large1.pdf'
api_url = 'https://v2.convertapi.com/upload'
with open(file_path, 'rb') as file:
response = requests.post(api_url, files={'file': file}, headers={'Authorization': f'Bearer {api_secret}'})
upload_io = response.json()
print("Upload time taken: %s seconds" % (time.time() - start_time))
convertapi.api_secret = api_secret
start_time = time.time()
result = convertapi.convert('squeeze', {'File': upload_io['Url']}, from_format='pdf')
print("API time taken: %s seconds" % (time.time() - start_time))
start_time = time.time()
saved_files = result.save_files(tempfile.gettempdir())
print("Download time taken: %s seconds" % (time.time() - start_time))
print("The PDF saved to %s" % saved_files)
@paulius-petkus very interesting finding that timeout param is causing the slowdown!
Which requests
library version you are using? If not the latest, maybe you could check if you can reproduce this problem on latest 2.32.3 version?
Also could you please check if specifying timeout tuple (connect + request timeout) makes any difference:
response = s.post(api_url, data = file, headers = headers, timeout = (3,1800))
And, just out of curiosity, does smaller timeout makes any difference:
response = s.post(api_url, data = file, headers = headers, timeout = 60)
@laurynas-convertapi requests version is 2.32.3.
Checked with tuple and with small timeout: both scenarios are same slow. Here are output results:
@paulius-petkus sad to see it doesn't help. apparently there is some bug in the underlying libraries on Windows which causes the slowdown when timeout is specified.
Could you please try one more thing - just use standard convertapi python library and set upload timeout to none:
convertapi.upload_timeout = None
I guess it should help. If it works, we could consider changing upload_timeout default to None
.
@laurynas-convertapi yes this setting timeout to None improved results as expected.
The same 100mb file conversion results:
@laurynas-convertapi, what is the status of this issue?
@tomasr78 will change default upload and download to None
:
https://github.com/ConvertAPI/convertapi-library-python/pull/51
Released in version https://github.com/ConvertAPI/convertapi-library-python/releases/tag/v2.0.0
Tested PDF to Squeeze with 90MB Pdf file: https://www.convertapi.com/a/api/pdf-to-squeeze#snippet=python It took 4.5-5s to convert + file upload and download couple seconds. So <10s overall.
With python lib it took >60s: