clowder-framework / extractors-s2orc-pdf2text

Extractor to convert pdf to text
Apache License 2.0
1 stars 0 forks source link

19 sentence coordinates json file #21

Closed minump closed 5 months ago

minump commented 8 months ago

tei_to_json.py file is changed to have sentence coordinates included in the json file. Main change is in the process_paragraph method, where the return text dictionary is of the format {text: [ {sentence :str, coords: str} ], cite_spans: List, ref_spans: List, eq_spans: List, section: List}

The json file is then read and converted to a df with columns new_row = {'file': input_file, 'section': para['section'], 'sentence': s['sentence'], 'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence': tokenized_sentence, 'coordinates': s['coords']} and written to a csv file.

Fixes #19 #20

minump commented 8 months ago

Published to hub.ncsa.illinois.edu/clowder/extractors-pdf2text:0.8.0 . Works fine in deployed instance.

minump commented 7 months ago

Drop columns 'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence': from df. No need to write these columns in csv file

minump commented 7 months ago

Change extractor info to reflect it uploads a csv file to clowder.

minump commented 6 months ago

Drop columns 'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence': from df. No need to write these columns in csv file

done. The only columns are 'file', 'section', 'sentence', 'coordinates'

minump commented 6 months ago

Change extractor info to reflect it uploads a csv file to clowder.

done. extractor info is changed.

minump commented 6 months ago

Pushed and deployed. extractors-pdf2text:0.8.2. -- works fine

minump commented 5 months ago

Merging to main.