IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
300 stars 132 forks source link

[Bug] Testing Rag notebook with latest release of pdf2Parquet, eDedup and DocID #583

Open touma-I opened 2 months ago

touma-I commented 2 months ago

Search before asking

Component

Transforms/universal/doc_id, Transforms/universal/ededup, Transforms/Other, Other

What happened + What you expected to happen

  1. @dolfim-ibm When running the rag notebook with the latest release of pdf2Parquet, the notebook crashes when downloading the model for the first time. Re-running the cell we do not see the error: If the model is already in the .EasyOCR folder, then the error will not show up. Details of the error can be found cell 6 of this notebook: https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.error.ipynb
  2. @sujee There are a few changes that need to be made to the notebook for it to work with the new release. Primarily: replacelauncher = RayTransformLauncher(EdedupRayTransformConfiguration()) withlauncher = RayTransformLauncher(EdedupRayTransformRuntimeConfiguration()) replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration()) with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) replace launcher = RayTransformLauncher(DocIDRayTransformConfiguration()) with launcher = RayTransformLauncher(DocIDRayTransformRuntimeConfiguration()) replace output_df.sample(3) with output_df.sample(len(output_df))

    For a complete reference on the required changes, please see https://github.com/IBM/data-prep-kit/blob/t2/examples/notebooks/rag/rag_1A_dpk_process_ray.dev3.ipynb.

Reproduction script

data-prep-kit/examples/notebooks/rag/requirement.txt in the rag folder was modified to temporarily load the various modules from git. Once we have this issue resolved or a work around has been identified, I will create a dev3 release. For now, please use the git repo as follow:

git clone https://github.com/IBM/data-prep-kit.git t2
cd t2/examples/notebooks/rag && git checkout t2
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
./venv/bin/jupyter lab

from the browser, select and run the notebook rag_1A_dpk_process_ray.dev3.ipynb

cc: @Shahrok

Anything else

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

OS

MacOS (limited support)

Python

3.11.x

Are you willing to submit a PR?

dolfim-ibm commented 2 months ago

@dolfim-ibm : deepsearch-toolkit 1.0.0 requires platformdirs<4.0.0,>=3.5.1, but the ray runtime prefers 4.3.2 .

This was just fixed yesterday. new install should use directly deepsearch-toolkit 1.0.1 which fixes it.

@dolfim-ibm : I noticed that pdf2parquet depends on docling==1.7.0 and doc_chunk depends on docling>=1.8.2,<2.0.0. In the requirements for the notebook, I changed pdf2parquet dependency to docling>=1.7.0

Yes, I think it should be good to go with docling>=1.7.0,<2.0.0.

dolfim-ibm commented 2 months ago

Regarding the models download, I'm able to reproduce it. Can you please try again with the latest version of the branch?

touma-I commented 2 months ago

@dolfim-ibm We still have the same problem even when using the latest release. Looking at the changes, I don't see how it would have addressed this problem. Please advise. Thanks

-        num_tables = len(doc.output.tables if doc.output.tables is not None else 0)
-        num_doc_elements = len(
-            doc.output.main_text if doc.output.main_text is not None else 0
-        )
+        num_tables = len(doc.output.tables) if doc.output.tables is not None else 0
+        num_doc_elements = len(doc.output.main_text) if doc.output.main_text is not None else 0
sujee commented 2 months ago

https://github.com/sujee/data-prep-kit/commit/08024dc3b049ca69bf4ffa84352754867dbd3f79

makes required changes.

Related : #585

dolfim-ibm commented 2 months ago

@sujee @touma-I I think this is now resolved, can you please confirm?

sujee commented 2 months ago

I have made the necessary changes on my branch. Will submit a PR soon