data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
236 stars 82 forks source link

Prefix crawling is crawling complete bucket instead of specific folder #428

Closed mvidhu closed 7 months ago

mvidhu commented 1 year ago

Describe the bug

When we enable prefix in the start crawl section and provide specific folder name to crawl, still the crawler is creating many tables instead of only specific table within the prefix.

Eg: I want to crawl specific folder sample inside my bucket, I am specifying the prefix as sample/. I also tried multiple combinations like /sample, sample, /sample/. All the combinations did not work and I always have crawler create tables for entire bucket structure.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

Import a S3 bucket containing many folders to data.all Start a crawler with prefix toggle enabled and provide a folder name for prefix. Refresh after few minutes, either the crawler creates all tables or sometimes none.

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11

AWS data.all version

0.5.0

Additional context

No response

dlpzx commented 1 year ago

@mvidhu Thanks for raising the issue, we are looking into it :)

dlpzx commented 1 year ago

After May 8th we will pick it up

zsaltys commented 8 months ago

@dlpzx I think the crawler needs a bit of a rethink... When creating crawlers there are a lot more options to specify in AWS. And they can be hard to get right so that your tables are detected correctly etc. For this reason alone we have them disabled completely on our end (though users still managed to find them deployed and tried to run them manually /facepalm)

I think what is needed:

1) Add more options for how crawler should be setup 2) allow to see logs of what the last crawler run did or what error it got 3) (OPTIONAL) I should be allowed to recrawl my dataset completely and remove any tables created before. This is because you can run a crawler incorrectly and it won't create what you expect. We need to be very careful here to not let users do that without double confirmation and certainly not if they have existing shares on tables.

dlpzx commented 7 months ago

@mvidhu this item will be released in v2.4.0