Closed minump closed 5 months ago
Published to hub.ncsa.illinois.edu/clowder/extractors-pdf2text:0.8.0 . Works fine in deployed instance.
Drop columns 'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence':
from df. No need to write these columns in csv file
Change extractor info to reflect it uploads a csv file to clowder.
Drop columns
'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence':
from df. No need to write these columns in csv file
done. The only columns are 'file', 'section', 'sentence', 'coordinates'
Change extractor info to reflect it uploads a csv file to clowder.
done. extractor info is changed.
Pushed and deployed. extractors-pdf2text:0.8.2. -- works fine
Merging to main.
tei_to_json.py
file is changed to have sentence coordinates included in the json file. Main change is in theprocess_paragraph
method, where the return text dictionary is of the format{text: [ {sentence :str, coords: str} ], cite_spans: List, ref_spans: List, eq_spans: List, section: List}
The json file is then read and converted to a df with columns
new_row = {'file': input_file, 'section': para['section'], 'sentence': s['sentence'], 'prev_sentence': '', 'next_sentence': '', 'tokenized_sentence': tokenized_sentence, 'coordinates': s['coords']}
and written to a csv file.Fixes #19 #20