Closed sujee closed 1 month ago
@sujee This is a nice introductory example. I was able to run the python version of this on colab, but because fuzzy dedup is only available with ray, I cannot see whether fuzzy dedup has a positive effect on reducing the number of chunks or not. On the other hand, testing the ray version gives a "ray job failed' error in pdf2parquet (before getting to doc id error in issue #719 ), so let's wait and see if PR #721 fixes the Ray issues.
Thanks for reviewing @shahrokhDaijavad
1 - Ray version erroring on pdf2pq step is due to downloaded model cleanup, i believe : #667 Is there fix in the works for this?
2 - are we good on location of this example : examples/notebooks/intro
?
3 - Yes, fuzzy dedupe will remove a similar chunk. So that's nice to see :-)
@sujee
- For the ray version, I see 3 errors. One is [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667: ERROR - Exception creating transform [Errno
confirming:
1A. [Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667 Possible fix : add MultiLock class #693
1B. [Bug] one of the created Ray actors die during docid transform #722
1C. Related to #722 above. [Bug] docid ray transformation errors when running on colab (release 0.2.2dev1) #719 possible fix: Fix metadata logging even when actors crash #721
- Good to see the effectiveness of Fuzzy dedup. Does the ray version run successfully on the local machine?
Yes, completes on local dev env :white_check_mark:
Updated using DPK release 0.2.1
Note: Once merged, I will do a followup PR to update the URLs to reflect the main repo
Why are these changes needed?
This example showcases some of the useful transforms of DPK.
PDFs ---> text ---> chunks ---> exact dedupe ---> fuzzy dedupe ---> embeddings
For reviewers
examples/notebooks/intro
as of nowinput/solar-system
. I hope this is okRelated issue number (if any).