DS3Lab / WordScape

The WordScape repository contains code for the WordScape pipeline to create datasets to train document understanding models.
Apache License 2.0
32 stars 4 forks source link

Need help with personal docx files #1

Open mattolson93 opened 10 months ago

mattolson93 commented 10 months ago

Hello, great poster at NuerIPS and it was good to meet you all!

I have some custom docx files (pdfs that I converted to docx with adobe), that I am trying to extract text from. I am able to get the docker file up and running, and I've modified run_single_node.sh to run just the annotation on my_docxs.tar.gz in the data folder. The script seems to execute, but I don't see anything in failed or extracted text. What am I doing wrong? I've pasted the whole log below, and I've also tried a tar of just a simple docx with random text in to verify it's not my converted files causing the issue.

Lastly, a demo for just using a personal set of docxs that works for you would be very helpful in debugging.

Thanks, Matt Olson

[2023-12-21 21:21:36,464]::MainProcess          ::INFO::source_tars: [PosixPath('data/paper.tar.gz'), PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:36,471]::MainProcess          ::INFO::args: {'data_dir': 'data', 'output_dir': './data/out', 'input_files': None, 'crawl_id': 'test', 'max_docs': -1, 'soffice_executable': 'soffice', 'config': 'configs/default_config.yaml', 'job_id': None}
[2023-12-21 21:21:36,475]::MainProcess          ::INFO::results_dir: data/out
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::annotations_dir: data/out/multimodal
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::meta_dir: data/out/meta
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::text_dir: data/out/text
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::failed_dir: data/out/failed
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::num_annotators: 2
[2023-12-21 21:21:36,482]::MainProcess          ::INFO::max_docs_per_process: -1
[2023-12-21 21:21:36,486]::AnnotationMonitor-2  ::INFO::Start monitoring...
[2023-12-21 21:21:41,273]::MainProcess          ::INFO::soffice(PID=104) started @ localhost:38357
[2023-12-21 21:21:41,276]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:41,277]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper.tar.gz')]
[2023-12-21 21:21:45,824]::MainProcess          ::INFO::soffice(PID=178) started @ localhost:58509
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:45,838]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 start processing data/paper2.tar.gz.
[2023-12-21 21:21:45,837]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 start processing data/paper.tar.gz.
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmpyo4oefe3'
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 finished. Shutting down.
[2023-12-21 21:21:45,860]::AnnotatorProcess-3   ::INFO::shutting down soffice process with pid 104
[2023-12-21 21:21:45,944]::AnnotatorProcess-4   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmp3veibb1j'
[2023-12-21 21:21:45,945]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 finished. Shutting down.
[2023-12-21 21:21:45,947]::AnnotatorProcess-4   ::INFO::shutting down soffice process with pid 178
[2023-12-21 21:21:46,892]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 done.
[2023-12-21 21:21:46,971]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 done.
[2023-12-21 21:21:46,980]::AnnotationMonitor-2  ::INFO::AnnotationMonitor done.
[2023-12-21 21:21:46,991]::MainProcess          ::INFO::annotation done.
[2023-12-21 21:21:46,992]::MainProcess          ::INFO::total time: 0:00:10.557422
zhangzhiyang-2020 commented 7 months ago

I met the same errror: (self.run) FileNotFoundError: [Errno 2] No such file or directory ... ... Did you solve this problem? Thanks in advance!

mattolson93 commented 7 months ago

No sorry :( I gave up on using docs