IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

Intro example 1 #718

Closed sujee closed 1 month ago

sujee commented 1 month ago

Why are these changes needed?

This example showcases some of the useful transforms of DPK.

PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings

For reviewers

Related issue number (if any).

shahrokhDaijavad commented 1 month ago

@sujee This is a nice introductory example. I was able to run the python version of this on colab, but because fuzzy dedup is only available with ray, I cannot see whether fuzzy dedup has a positive effect on reducing the number of chunks or not. On the other hand, testing the ray version gives a "ray job failed' error in pdf2parquet (before getting to doc id error in issue #719 ), so let's wait and see if PR #721 fixes the Ray issues.

sujee commented 1 month ago

Thanks for reviewing @shahrokhDaijavad

1 - Ray version erroring on pdf2pq step is due to downloaded model cleanup, i believe : #667 Is there fix in the works for this?

2 - are we good on location of this example : examples/notebooks/intro ?

3 - Yes, fuzzy dedupe will remove a similar chunk. So that's nice to see :-)

shahrokhDaijavad commented 1 month ago

@sujee

  1. For the ray version, I see 3 errors. One is #667: ERROR - Exception creating transform [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip' The second one is this: ERROR - Exception during execution out of 2 created actors only 1 alive and the third one is: ERROR - Exception during execution 'processing_time'
  2. The location of this intro example is good.
  3. Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?
sujee commented 1 month ago
  1. For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno

confirming:

1A. [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667 Possible fix : add MultiLock class #693

1B. [Bug] one of the created Ray actors die during docid transform #722

1C. Related to #722 above. [Bug] docid ray transformation errors when running on colab (release 0.2.2dev1) #719 possible fix: Fix metadata logging even when actors crash #721

  1. Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?

Yes, completes on local dev env :white_check_mark:

sujee commented 1 month ago

Updated using DPK release 0.2.1

Note: Once merged, I will do a followup PR to update the URLs to reflect the main repo