IBM / data-prep-kit

Open source project for data preparation of LLM application builders
https://ibm.github.io/data-prep-kit/
Apache License 2.0
307 stars 134 forks source link

[Bug] dpk-connector silently fails if download destination directory does not exist #778

Closed sujee closed 2 weeks ago

sujee commented 2 weeks ago

Search before asking

Component

Other

What happened + What you expected to happen

When downloading files, if download dir is not present the crawl silently fails.

Recommendations:

Reproduction script

https://github.com/sujee/data-prep-kit/blob/html-processing-1/examples/notebooks/html-processing/1_download_site.ipynb

Anything else

data_prep_connector 0.2.2

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

sujee commented 2 weeks ago

CC : @Qiragg

touma-I commented 2 weeks ago

@Qiragg @hmtbr is this really a bug or is it that @sujee is trying to use it in a way that is not intended for?

hmtbr commented 2 weeks ago

@touma-I It's not a bug of the connector library. The data-prep-connector has no access to storage including local directories. If the description is true, you have to fix your notebook.

Qiragg commented 2 weeks ago

@touma-I It is not a bug, the connector is working as intended.

We push the responsibility of managing the processing and storage of the acquired content to the user. The user has to design the logic and include any error handling, if any - as @sujee encountered in his example.

In a separate issue (https://github.com/IBM/data-prep-kit/issues/777) that @sujee raised, I added an example of how to catch errors that happen during the user-defined callback function which is the case that we are dealing with. I designed the original example so I felt it necessary to also add the error-handling so the user can notice the errors in the callback function. The errors are not arising from the core-connector logic.

Happy to chat further if there's some unanswered questions of confusion.

touma-I commented 2 weeks ago

Closing this issue as the code is working as expected and a new transform is being developed to handle most of the I/O.