Open Risho92 opened 11 months ago
If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.
Thanks
Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.
I used the below command to extract text from a pdf using textractor
I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.
https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter
I tried with multi page pdf and single page pdf, but always getting this error.
I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.
Given below are the environment details
Operation System: Windows 11 Pro Python Version: 3.10.12
amazon-textract-caller==0.2.1 amazon-textract-pipeline-pagedimensions==0.0.9 amazon-textract-prettyprinter==0.1.8 amazon-textract-textractor==1.4.5 amazon-textract-response-parser==1.0.2 marshmallow==3.20.1 textract-trp==0.1.3
Any help to get this error resolved is highly appreciated.