AWS crawler usage - Githubissues

ForeverAngry commented 7 months ago

When it comes to the documentation, and running the sync process - what is the role of the glue crawler? Is it a needed process. I can create metadata without running it. So I'm wondering if I'm missing something, or if someone can help fill in the gaps for me! Thanks!

sagarlakshmipathy commented 7 months ago

@ForeverAngry Glue crawler is a way to help catalog your data in Glue catalog which will help you query tables externally say Athena, EMR Spark/EMR Trino etc.

One other way is to catalog the tables during the sync process: https://medium.com/@sagarlakshmipathy/using-onetable-to-translate-a-hudi-table-to-iceberg-format-and-sync-with-glue-catalog-8c3071f08877

If you follow the above blog, you'll no longer need to run the crawler process again.

I'll add the steps directly to the XTable docs later.

ForeverAngry commented 7 months ago

Hi! Thanks for the response! Let me know if u understand this correctly. So after we run the sync process, if we use crawlers, we don't have to keep running the sync process.

Or, you can not use a crawler and just use the sync process whenever you need to update the metadata. Do I understand this correctly?

sagarlakshmipathy commented 7 months ago

Getting a bit more specific here. So Glue Crawler is responsible to catalog the data into Glue databases and XTable sync is for translating the metadata to another table format, they both exist for different reasons.

So after we run the sync process, if we use crawlers, we don't have to keep running the sync process.

For example when your pipeline has Source Hudi tables and Target Iceberg tables and you're in AWS ecosystem where you use Athena for querying: For the first time, you are going to run XTable sync to translate the Hudi table to Iceberg table. At this point you have to catalog to Glue database so you can query from Athena. Now you can either use Glue Crawler for this cataloging purpose or you can use catalog.yaml as shown in the medium example. If there is changes in upstream source, say your Hudi table gets changed, you need to run XTable sync process to translate the incremental changes to Iceberg. At this point, you don't need to run crawler again unless there are schema changes.

Am I making sense?

ForeverAngry commented 7 months ago

Ahhh I see. That makes complete sense, thank you!

the-other-tim-brown commented 7 months ago

@ForeverAngry can this issue be closed?

ForeverAngry commented 7 months ago

Yes, thank you!

apache / incubator-xtable

AWS crawler usage #414