Open touma-I opened 2 months ago
@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .
This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.
@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0
Yes, I think it should be good to go with docling>=1.7.0,<2.0.0
.
Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch?
@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks
- num_tables = len(doc.output.tables if doc.output.tables is not None else 0)
- num_doc_elements = len(
- doc.output.main_text if doc.output.main_text is not None else 0
- )
+ num_tables = len(doc.output.tables) if doc.output.tables is not None else 0
+ num_doc_elements = len(doc.output.main_text) if doc.output.main_text is not None else 0
https://github.com/sujee/data-prep-kit/commit/08024dc3b049ca69bf4ffa84352754867dbd3f79
makes required changes.
Related : #585
@sujee @touma-I I think this is now resolved, can you please confirm?
I have made the necessary changes on my branch. Will submit a PR soon
Search before asking
Component
Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other
What happened + What you expected to happen
@sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily: replace
launcher = RayTransformLauncher(EdedupRayTransformConfiguration())
withlauncher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration())
replacelauncher = RayTransformLauncher(DocIDRayTransformConfiguration())
withlauncher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replacelauncher = RayTransformLauncher(DocIDRayTransformConfiguration())
withlauncher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration())
replaceoutput_df.sample(3)
withoutput_df.sample(len(output_df))
For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.
Reproduction script
data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:
from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb
cc: @Shahrok
Anything else
@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0
@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .
OS
MacOS (limited support)
Python
3.11.x
Are you willing to submit a PR?