dataforgoodfr / batch7_rse

A search engine for French corporate societal and environnemental commitments and actions.
http://dataforgood.fr/batch7_rse/
MIT License
5 stars 2 forks source link

Problème de parsing pdf #28

Closed Hugo-GEE closed 4 years ago

Hugo-GEE commented 4 years ago

Bonjour à tous. J'ai rencontré un problème en lançant la commande python3 main.py dans la dernière version de la branche master. Toutes mes installations sont OK mais @CharlesGaydon et moi avons rencontré la même erreur à peu de choses près. Le premier commit avec lequel je rencontre un problème dans le parsing est 469de4fe56396153ea12e05418269e7b0a0808a3 (branche master).

`python main.py

Begin Initialization. Multiprocessing with 3 cores 0%| | 0/16 [00:00<?, ?it/s]Start for total [total_2018_ddr.pdf] Start for orano [orano_2018_ddr.pdf] Start for edf [edf_2018_ddr.pdf]

orano

orano End for orano [orano_2018_ddr.pdf] - took -5 seconds Start for engie [engie_2018_ddr.pdf]

engie End for engie [engie_2018_ddr.pdf] - took -21 seconds Start for casino [casino_2018_dpef.pdf]

edf End for edf [edf_2018_ddr.pdf] - took -71 seconds End for total [total_2018_ddr.pdf] - took -72 seconds

casino

casino End for casino [casino_2018_dpef.pdf] - took -19 seconds 6%|█████▌ | 1/16 [01:45<26:16, 105.12s/it]Start for carrefour [carrefour_2018_ddr.pdf] Start for auchanholding [auchanholding_2018_ddr.pdf] 12%|███████████▎ | 2/16 [02:02<18:21, 78.65s/it]Start for scaouest [scaouest_2018_dpef.pdf]

auchanholding End for auchanholding [auchanholding_2018_ddr.pdf] - took -23 seconds

scaouest

scaouest End for scaouest [scaouest_2018_dpef.pdf] - took -23 seconds Start for vinci [vinci_2018_ddr.pdf] Start for bouyguesconstruction [bouyguesconstruction_2018_ddr.pdf]

bouyguesconstruction End for bouyguesconstruction [bouyguesconstruction_2018_ddr.pdf] - took -9 seconds Start for saintgobain [saintgobain_2018_ddr.pdf]

saintgobain End for saintgobain [saintgobain_2018_ddr.pdf] - took -14 seconds

vinci End for vinci [vinci_2018_ddr.pdf] - took -42 seconds Start for eiffage [eiffage_2018_ddr.pdf]

carrefour End for carrefour [carrefour_2018_ddr.pdf] - took -95 seconds Start for lvmh [lvmh_2018_rse.pdf] 38%|█████████████████████████████████▊ | 6/16 [03:50<10:31, 63.16s/it]Start for total [total_2018_debug.pdf] Start for edf [edf_2018_debug.pdf] Start for michelin [michelin_2018_ddr.pdf]

lvmh

lvmh End for lvmh [lvmh_2018_rse.pdf] - took -11 seconds

eiffage End for eiffage [eiffage_2018_ddr.pdf] - took -38 seconds 81%|████████████████████████████████████████████████████████████████████████▎ | 13/16 [04:27<01:01, 20.55s/it] multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 447, in get_sentences_dataframe_from_pdf df_par = get_paragraphs_dataframe_from_pdf(dpef_path, companies_metadata_dict) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 428, in get_paragraphs_dataframe_from_pdf df_par = parse_paragraphs_from_pdf(dpef_path, rse_ranges=rse_ranges) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 298, in parse_paragraphs_from_pdf df_par = get_paragraphs_from_raw_content(df_par, idx_first_page) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 168, in get_paragraphs_from_raw_content (page_id, xmin, , , , _) = device.rows[item_index] IndexError: list index out of range """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "main.py", line 46, in main() File "main.py", line 38, in main run_parser(config) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 496, in run df_sents = get_sentences_from_all_pdfs(config) File "/Users/hugomuselli/DataForGood/batch7_rse/webapp/polls/rse_model/rse_watch/pdf_parser.py", line 474, in get_sentences_from_all_pdfs total=len(all_input_files) File "/Users/hugomuselli/.virtualenvs/rse_watch/lib/python3.7/site-packages/tqdm/std.py", line 1127, in iter for obj in iterable: File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/pool.py", line 748, in next raise value IndexError: list index out of range`

CharlesGaydon commented 4 years ago

Effectivement il semble qu'il y a des cas limites sur certains pdfs. Je m'en occupe dès que j'ai terminé de gérer la façon dont remplir la base sql.

CharlesGaydon commented 4 years ago

Résolu dans https://github.com/dataforgoodfr/batch7_rse/commit/2f22ba5a8912e394cadce41482b308372034ebfc : Added try except to catch invalid index in device.rows