Closed mvidhu closed 7 months ago
@mvidhu Thanks for raising the issue, we are looking into it :)
After May 8th we will pick it up
@dlpzx I think the crawler needs a bit of a rethink... When creating crawlers there are a lot more options to specify in AWS. And they can be hard to get right so that your tables are detected correctly etc. For this reason alone we have them disabled completely on our end (though users still managed to find them deployed and tried to run them manually /facepalm)
I think what is needed:
1) Add more options for how crawler should be setup 2) allow to see logs of what the last crawler run did or what error it got 3) (OPTIONAL) I should be allowed to recrawl my dataset completely and remove any tables created before. This is because you can run a crawler incorrectly and it won't create what you expect. We need to be very careful here to not let users do that without double confirmation and certainly not if they have existing shares on tables.
@mvidhu this item will be released in v2.4.0
Describe the bug
When we enable prefix in the start crawl section and provide specific folder name to crawl, still the crawler is creating many tables instead of only specific table within the prefix.
Eg: I want to crawl specific folder sample inside my bucket, I am specifying the prefix as sample/. I also tried multiple combinations like /sample, sample, /sample/. All the combinations did not work and I always have crawler create tables for entire bucket structure.
How to Reproduce
Import a S3 bucket containing many folders to data.all Start a crawler with prefix toggle enabled and provide a folder name for prefix. Refresh after few minutes, either the crawler creates all tables or sometimes none.
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Mac
Python version
3.11
AWS data.all version
0.5.0
Additional context
No response