Transkribus / TranskribusPyClient

A Pythonic API and some command line tools to access the Transkribus server via its REST API
GNU Lesser General Public License v3.0
27 stars 14 forks source link

Programmatic extraction of transcriptions #11

Open osherenko opened 1 year ago

osherenko commented 1 year ago

I wonder if extracting stored transcriptions for Transcribus images is possible using the Python client.

Is it possible to start OCR using a particular recognition model?

When running particular commands, I get an error. For example, after calling do_listHtrRnn.py, I get

raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: 404 for url: https://transkribus.eu/TrpServer/rest/recognition/htrModels. 
DRRV commented 1 year ago

It works on my side, but you need to provide some parameters for each command such as collection id (see --help) do_listHtrRnn.py --collid XXX

for running an ocr model: do_htrRnn.py --docid DOCID

for downloading transcripts, you can use Transkribus_downloader.py

osherenko commented 1 year ago

If I run "do_listHtrRnn.py --colid=1219483", I get the same URL exception

Traceback (most recent call last):
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 115, in <module>
    doer.run(options.colid,options.dict)
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 75, in run
    sColModels = self.listRnns(colid)
  File "d:\python310\lib\site-packages\TranskribusPyClient\client.py", line 878, in listRnns
    resp.raise_for_status()
  File "d:\python310\lib\site-packages\requests\models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/1219483/list?prov=CITlab 

However, the command

python Transkribus_downloader.py,

works just fine and downloads four files: JPG, pxml, max and json files.

DRRV commented 1 year ago

you need to login in and provider credits (see https://github.com/Transkribus/TranskribusPyClient/wiki) if you use a persistent login before: do_login.py --persist --login --pwd

you need to add --persistent do_listHtrRnn.py --persistent --colid=1219483

osherenko commented 1 year ago

The output of do_login.py --persist --login --pwd

- Logging onto Transkribus as --pwd and making a persistent session
403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/auth/login

The output of do_listHtrRnn.py --persistent --colid=1219483

Usage: do_listHtrRnn.py

do_listHtrRnn.py: error: no such option: --persistent
DRRV commented 1 year ago

my bad it is --persist

by typing python --help you get the parameter list:

python src/TranskribusCommands/do_listHtrRnn.py --help Usage: src/TranskribusCommands/do_listHtrRnn.py

List HTR RNN models and dictionaries available in Transkribus. Pass your login/password as options otherwise consider having a Transkribus_credential.py file, which defines a 'login' and a 'pwd' variables. If you need to use a proxy, use the --https_proxy option or set the environment variables HTTPS_PROXY. To use HTTP Basic Auth with your proxy, use the http://user:password@host/ syntax.

Options: --version show program's version number and exit -h, --help show this help message and exit --colid=COLID get models linked to the colid --dict get dictionaries -s SERVER, --server=SERVER Transkribus server URL -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py') -p PWD, --pwd=PWD Transkribus password --persist Try using an existing persistent session, or log-in and persists the session. --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000

osherenko commented 1 year ago

It still doesn't work. If I call, I get an exception.

do_listHtrRnn.py -persist --colid=1219483
Traceback (most recent call last):
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 115, in <module>
    doer.run(options.colid,options.dict)
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 75, in run
    sColModels = self.listRnns(colid)
  File "d:\python310\lib\site-packages\TranskribusPyClient\client.py", line 878, in listRnns
    resp.raise_for_status()
  File "d:\python310\lib\site-packages\requests\models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/1219483/list?prov=CITlab

I am using transkribus_credentials.py.

DRRV commented 1 year ago

do_login.py --persist --login YOURLOGIN --pwd YOURPASSWORD do_listHtrRnn.py -persist --colid=1219483

or

do_listHtrRnn.py --colid=1219483 --login YOURLOGIN --pwd YOURPASSWORD

osherenko commented 1 year ago

I get

D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands>d:\python39\python.exe do_login.py -persist -login xxx  -pwd xxx
- Checking Transkribus login as ogin 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/auth/login
D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands>d:\python39\python.exe do_listHtrRnn.py -persist -colid=1219483
Usage: do_listHtrRnn.py

do_listHtrRnn.py: error: no such option: -c

D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands>d:\python39\python.exe do_listHtrRnn.py --colid=1219483 --login xxx --pwd xxx
Traceback (most recent call last):
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 115, in <module>
    doer.run(options.colid,options.dict)
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands\do_listHtrRnn.py", line 75, in run
    sColModels = self.listRnns(colid)
  File "D:\Downloads\TranskribusPyClient-master\src\TranskribusPyClient\client.py", line 878, in listRnns
    resp.raise_for_status()
  File "d:\python39\lib\site-packages\requests\models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/1219483/list?prov=CITlab
DRRV commented 1 year ago

--colid and not only one single '-'

osherenko commented 1 year ago

D:\Downloads\TranskribusPyClient-master\src\TranskribusCommands>d:\python39\python.exe do_login.py -persist -login xxx -pwd xxx

DRRV commented 1 year ago

OK a convention in many python scripts: for command line parameters with several letters you need to use '--' with one letter a single '-' python.exe do_login.py --persist --login xxx --pwd xxx or python.exe do_login.py --persist -l xxx --p xx (! note --persist with 2 '-')

As long as you don't get this output below with do_login: you're not logged in properly:

python.exe do_login.py --persist -l xxx --p xx
Logging onto Transkribus as xxx and making a persistent session --> .trnskrbs/session.txt Done

Again use --help to know what is the command line syntax and you will see when '-' must be used and when '--' must be used python.exe do_login.py --help Options: --version show program's version number and exit -h, --help show this help message and exit -s SERVER, --server=SERVER Transkribus server URL -l LOGIN, --login=LOGIN Transkribus login (consider storing your credentials in 'transkribus_credentials.py') -p PWD, --pwd=PWD Transkribus password --persist Try using an existing persistent session, or log-in and persists the session. --https_proxy=HTTPS_PROXY proxy, e.g. http://cornillon:8000

hackmanschorsch commented 1 year ago

Sorry for the late entry but i only now realized that prov=CITlab may causes the problem as this is no longer supported. And this endpoint is no longer used. Please use this instead: https://transkribus.eu/TrpServer/rest/models/text?prov=PyLaia And you get the details for a model with: https://transkribus.eu/TrpServer/rest/models/text/1234

osherenko commented 1 year ago

Thanks! It really makes sense.

I am actually testing TranskribusPyClient and it calls particular REST functions. How can I switch to PyLaia in TranskribusPyClient? Or should I use PyLaia instead?

DRRV commented 1 year ago

the proc=Citlab was fixed in the last commit (12 days ago) when htr+ was disabled. just pull the last version.

osherenko commented 1 year ago

I pulled the last version. if I run

d:\Python310\python.exe TranskribusPyClient\src\TranskribusCommands\do_listHtrRnn.py --colid=169748

I get the correct output. If I run

d:\Python310\python.exe TranskribusPyClient\src\TranskribusCommands\do_listHtrRnn.py --colid=715112

I get an exception

Traceback (most recent call last): File "E:\Git\TranskribusPyClient\src\TranskribusCommands\do_listHtrRnn.py", line 117, in <module> doer.run(options.colid,options.dict) File "E:\Git\TranskribusPyClient\src\TranskribusCommands\do_listHtrRnn.py", line 75, in run sColModels = self.listRnns(colid) File "E:\Git\TranskribusPyClient\src\TranskribusPyClient\client.py", line 879, in listRnns resp.raise_for_status() File "d:\Python310\lib\site-packages\requests\models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/recognition/715112/list

As you see the only difference in both calls is the collection number.

DRRV commented 1 year ago

403 is a permission error: you must provide credentials (-l -p ) or add --persist : do_listHtrRnn.py --colid=169748 --persist (I assume you have access to collection 715112)

it is strange that do_listHtrRnn.py --colid=169748 works without credentials

osherenko commented 1 year ago

Weird! This script lists the documents in the first collection, crashes when it tries to list documents in another collection, and lists documents in the first collection once more. I don't know your code but it doesn't seem a problem on the client code. Could you run the script and tell me what output you get!

tc = client.TranskribusClient() tc.auth_login("XXX", "XXX")

print("1: my docs in collection: %s" % [t['title'] for t in tc.listDocsByCollectionId(colId = 169748)] ) # first output

try: print("docs in collection: %s" % [t['title'] for t in tc.listDocsByCollectionId(colId = 715112)] ) except as e: print("Crash!!!", e) # Client Error: 403 for url: https://transkribus.eu/TrpServer/rest/collections/715112/list

print("2: my docs in collection: %s" % [t['title'] for t in tc.listDocsByCollectionId(colId = 169748)] ) # second output

Am Mo., 19. Dez. 2022 um 11:19 Uhr schrieb Hervé Déjean < @.***>:

403 is a permission error: you must provide credentials (-l -p ) or add --persist : do_listHtrRnn.py --colid=169748 --persist (I assume you have access to collection 715112)

it is strange that do_listHtrRnn.py --colid=169748 works without credentials

— Reply to this email directly, view it on GitHub https://github.com/Transkribus/TranskribusPyClient/issues/11#issuecomment-1357417581, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3O4ZJ3ONLGNBDY2NTLWOAZERANCNFSM6AAAAAAST54GQA . You are receiving this because you authored the thread.Message ID: @.***>

DRRV commented 1 year ago

I don't have access rights to these collections so I cannot try. (I will get a 403 error). Are you sure you have access rights to 715112.

osherenko commented 1 year ago

After logging in Transkribus Expert Client with my credentials, I can see the 715112 collection and its transcription. So I assume that I have access rights to 715112. However, I am not a creator/owner of this collection and the creator/owner just allowed me to access the collection. How should I proceed?

Am Di., 20. Dez. 2022 um 12:55 Uhr schrieb Hervé Déjean < @.***>:

I don't have access rights to these collections so I cannot try. (I will get a 403 error). Are you sure you have access rights to 715112.

— Reply to this email directly, view it on GitHub https://github.com/Transkribus/TranskribusPyClient/issues/11#issuecomment-1359252120, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3NJFM36DTBES6PE7T3WOGNCXANCNFSM6AAAAAAST54GQA . You are receiving this because you authored the thread.Message ID: @.***>

DRRV commented 1 year ago

Have you tried with one collection you created? But if you can have access to the collection with Transkribus it should work with the python api

-----Original Message----- From: @.> To: @.>; Cc: "Hervé @.>; @.>; Sent: Tue, Dec 20, 2022 14:01 (GMT+01:00) Subject: Re: [Transkribus/TranskribusPyClient] Programmatic extraction of transcriptions (Issue #11)

After logging in Transkribus Expert Client with my credentials, I can see the 715112 collection and its transcription. So I assume that I have access rights to 715112. However, I am not a creator/owner of this collection and the creator/owner just allowed me to access the collection. How should I proceed?

Am Di., 20. Dez. 2022 um 12:55 Uhr schrieb Hervé Déjean < @.***>:

I don't have access rights to these collections so I cannot try. (I will get a 403 error). Are you sure you have access rights to 715112.

— Reply to this email directly, view it on GitHub https://github.com/Transkribus/TranskribusPyClient/issues/11#issuecomment-1359252120, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3NJFM36DTBES6PE7T3WOGNCXANCNFSM6AAAAAAST54GQA . You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

osherenko commented 1 year ago

Have you tried with one collection you created?

Yes, I am the creator of collection 169748 and everything works just fine.

But if you can have access to the collection with Transkribus it should work with the python api

It should, but it doesn't work. Everything seems to work in the Transkribus expert client. Unfortunately, I can't compare REST calls in the TranskribusPyClient and the Transkribus Expert Client. Is this information stored in the log file of the java client?

-----Original Message----- From: @.> To: @.>; Cc: "Hervé @.>; @.>; Sent: Tue, Dec 20, 2022 14:01 (GMT+01:00) Subject: Re: [Transkribus/TranskribusPyClient] Programmatic extraction of transcriptions (Issue #11)

After logging in Transkribus Expert Client with my credentials, I can see the 715112 collection and its transcription. So I assume that I have access rights to 715112. However, I am not a creator/owner of this collection and the creator/owner just allowed me to access the collection. How should I proceed?

Am Di., 20. Dez. 2022 um 12:55 Uhr schrieb Hervé Déjean < @.***>:

I don't have access rights to these collections so I cannot try. (I will get a 403 error). Are you sure you have access rights to 715112.

— Reply to this email directly, view it on GitHub < https://github.com/Transkribus/TranskribusPyClient/issues/11#issuecomment-1359252120 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/A4B2N3NJFM36DTBES6PE7T3WOGNCXANCNFSM6AAAAAAST54GQA

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/Transkribus/TranskribusPyClient/issues/11#issuecomment-1359579012, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3JRBNNPOKKYM4DALPTWOHHGLANCNFSM6AAAAAAST54GQA . You are receiving this because you authored the thread.Message ID: @.***>