Open darrayes opened 3 months ago
Hi @darrayes
Can you provide the code snippet that produces this behavior?
from unstructured_client import UnstructuredClient
from unstructured_client.models import operations, shared
os.environ["UNSTRUCTURED_API_KEY"] = "*****"
os.environ["UNSTRUCTURED_API_URL"] = "https://api.unstructured.io/general/v0/general"
client = UnstructuredClient(
api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
server_url=os.getenv("UNSTRUCTURED_API_URL")
)
input_filepath = "test-arabic-llm-cleaned.pdf"
output_filepath = "text-file.txt"
with open(input_filepath, "rb") as f:
files = shared.Files(
content=f.read(),
file_name=input_filepath
)
req = operations.PartitionRequest(
shared.PartitionParameters(
files=files,
strategy=shared.Strategy.OCR_ONLY,
languages=['ara'],
chunking_strategy=shared.ChunkingStrategy.BY_TITLE,
overlap=100,
)
)
try:
res = client.general.partition(request=req)
element_dicts = [element for element in res.elements]
json_elements = json.dumps(element_dicts, indent=2)
# Print the processed data.
print(json_elements)
# Write the processed data to a local file.
with open(output_filepath, "w") as file:
file.write(json_elements)
except Exception as e:
print(e)
I am trying to parse an Arabic file however what i observed is that, it skips certain lines from the text. I am attaching a page from the pdf and the corresponding text lines that have been missed are highlighted in boxes.