Closed Hugo-GEE closed 4 years ago
Effectivement il semble qu'il y a des cas limites sur certains pdfs. Je m'en occupe dès que j'ai terminé de gérer la façon dont remplir la base sql.
Résolu dans https://github.com/dataforgoodfr/batch7_rse/commit/2f22ba5a8912e394cadce41482b308372034ebfc : Added try except to catch invalid index in device.rows
Bonjour à tous. J'ai rencontré un problème en lançant la commande python3 main.py dans la dernière version de la branche master. Toutes mes installations sont OK mais @CharlesGaydon et moi avons rencontré la même erreur à peu de choses près. Le premier commit avec lequel je rencontre un problème dans le parsing est 469de4fe56396153ea12e05418269e7b0a0808a3 (branche master).
`python main.py
Begin Initialization. Multiprocessing with 3 cores 0%| | 0/16 [00:00<?, ?it/s]Start for total [total_2018_ddr.pdf] Start for orano [orano_2018_ddr.pdf] Start for edf [edf_2018_ddr.pdf]
orano
orano End for orano [orano_2018_ddr.pdf] - took -5 seconds Start for engie [engie_2018_ddr.pdf]
engie End for engie [engie_2018_ddr.pdf] - took -21 seconds Start for casino [casino_2018_dpef.pdf]
edf End for edf [edf_2018_ddr.pdf] - took -71 seconds End for total [total_2018_ddr.pdf] - took -72 seconds
casino
casino End for casino [casino_2018_dpef.pdf] - took -19 seconds 6%|█████▌ | 1/16 [01:45<26:16, 105.12s/it]Start for carrefour [carrefour_2018_ddr.pdf] Start for auchanholding [auchanholding_2018_ddr.pdf] 12%|███████████▎ | 2/16 [02:02<18:21, 78.65s/it]Start for scaouest [scaouest_2018_dpef.pdf]
auchanholding End for auchanholding [auchanholding_2018_ddr.pdf] - took -23 seconds
scaouest
scaouest End for scaouest [scaouest_2018_dpef.pdf] - took -23 seconds Start for vinci [vinci_2018_ddr.pdf] Start for bouyguesconstruction [bouyguesconstruction_2018_ddr.pdf]
bouyguesconstruction End for bouyguesconstruction [bouyguesconstruction_2018_ddr.pdf] - took -9 seconds Start for saintgobain [saintgobain_2018_ddr.pdf]
saintgobain End for saintgobain [saintgobain_2018_ddr.pdf] - took -14 seconds
vinci End for vinci [vinci_2018_ddr.pdf] - took -42 seconds Start for eiffage [eiffage_2018_ddr.pdf]
carrefour End for carrefour [carrefour_2018_ddr.pdf] - took -95 seconds Start for lvmh [lvmh_2018_rse.pdf] 38%|█████████████████████████████████▊ | 6/16 [03:50<10:31, 63.16s/it]Start for total [total_2018_debug.pdf] Start for edf [edf_2018_debug.pdf] Start for michelin [michelin_2018_ddr.pdf]
lvmh
lvmh End for lvmh [lvmh_2018_rse.pdf] - took -11 seconds
eiffage End for eiffage [eiffage_2018_ddr.pdf] - took -38 seconds 81%|████████████████████████████████████████████████████████████████████████▎ | 13/16 [04:27<01:01, 20.55s/it] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 447, in get_sentences_dataframe_from_pdf df_par = get_paragraphs_dataframe_from_pdf(dpef_path, companies_metadata_dict) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 428, in get_paragraphs_dataframe_from_pdf df_par = parse_paragraphs_from_pdf(dpef_path, rse_ranges=rse_ranges) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 298, in parse_paragraphs_from_pdf df_par = get_paragraphs_from_raw_content(df_par, idx_first_page) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 168, in get_paragraphs_from_raw_content (page_id, xmin, , , , _) = device.rows[item_index] IndexError: list index out of range """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "main.py", line 46, in
main()
File "main.py", line 38, in main
run_parser(config)
File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 496, in run
df_sents = get_sentences_from_all_pdfs(config)
File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 474, in get_sentences_from_all_pdfs
total=len(all_input_files)
File "/Users/hugomuselli/.virtualenvs/rse_watch/lib/python3.7/site-packages/tqdm/std.py", line 1127, in iter
for obj in iterable:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
IndexError: list index out of range`