Open vinothchandar opened 5 years ago
Not using it in prod yet, but very early stages of investigating its usage for ingestion use cases at Intuit.
At EMIS Health ([https://www.emishealth.com/]) we're using HUDI in production on AWS. We're the largest provider of Primary Care IT software in the UK and store over 500Bn healthcare records.
We use HUDI as a way to keep our analytics platform up-to-date from the source. We use Presto to query the data that is written.
We've only been using it for about 6 months - so far so good though!
We are exploring incremental data updates with HUDI, and rewrited code of consuming data. We hope HUDI help us increase incremental data write speed and save computing resouces. Now we are still in the exploration stage. It seems good for our needs.
machine learning
(multi-dimensional feature store
, JVM to Python data fabric
, etc).. HudiLink
: a way to use Hudi
as data layer from Apache Fink
(this is not about mixing the Spark
and Fink
programming models but more about integration of runtimes - Flink
checkpointing and Hudi
commits).Planning to keep vision & use case material here https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics.
Please capture your own use cases here for consideration. :-)
I am doing POC on MapR platform to build the datafabric in incremental fashion and for ETL offload needs.
@leilinen @amaranathv do you mind providing details of your organizations/teams to add to the blurb
Hey, We are using Hudi at Yotpo for several usages. Firstly, we integrated Hudi as a writer in our open source ETL framework: https://github.com/YotpoLtd/metorikku We currently use it as an output writer for a CDC pipeline - Events that are being generated from a database binlog streams to Kafka and then are written to S3. The cool thing with Hudi is that it helps maintain a consistent view of the data in our data lake when we dont need to merge the deletes and upserts ourselves. A blog post on that is being written and ill share it here once its done :)
More usages are coming, 10x Ron
Trying to build CDC pipeline for capturing sql changes and prepare data for analytics use cases. Not in production yet, right now trying to modify it as per the use case and organisation specific challenges. We are using Hudi 0.4.7 with spark 2.3.2, Hadoop 3.1.0, hive 3.1.0 and spark-streaming-kafka-0-10_2.11 library.
We have used it in Investment Banking area from dev all the way into production. Hudi has been mainly used to build upsertable gas, energy and oil markets data where deals often need to get updated.
We're currently deploying HUDI 0.5.0 in a EU Bank. Not in Prod yet. HUDI will be used to provide ACID ability for data ingestion batches and streams needing such feature (sourcing update files & CDC streams to HDFS).
We have been using HUDI to manage a data lake with 500+TB manufacturing data for almost a year now. In the IoT world, late arrival and update is a very common scenario and HUDI can handle it perfectly for us. We use Impala to query the data. The small file handling with easy partitioning feature of HUDI let us build an efficient structure to make the query on the fly. In addition, the incremental pulling makes the expensive batch jobs like aggregating BI dashboards and maintaining a large graph database much more efficient with the custom merging feature between the historical data and change data.
we are trying to use HUDI Deltastreamer to read from compacted Kafka topic in production environment and pull the messages incrementally and persist the data in MapR Platform in regular interval , tried initially with MOR with continuous mode of streamer faced small file issue , now planning to run it in minibatch(every 2 hours) with COPY_ON_WRITE to avoid compaction etc. we are facing offsetoutofrange exception ,tried with both option of auto.offset.reset= earliest as well as latest encountered the same offsetoutofrange exception. we notice that in log that auto.offset.reset is overriding to none WARN kafka010.KafkaUtils: overriding enable.auto.commit to false for executor WARN kafka010.KafkaUtils: overriding auto.offset.reset to none for executor
@garyli1019 do you mind sharing your company name/logo and is it okay to list this on powered_by?
@prashanthpdesai Let's offline the small file issue.. Interested to understand why it does not work for you in MOR.. As I understand it, you are still in pre-prod?
@vinothchandar : sure , yes we are still in pre prod .
@vinothchandar Will do once I clear some internal process
We are using it at an online casino based in Malta. We are using it in production, but only for a small part of our dataset. It's a large table that has frequent updates, and with hudi we are able to update it frequently and still not face any issue while querying. I just have a challenge now, and it's to make it available to the analysts in jupyter using livy. I was able to add the hudi jar to livy, but the httpclient lib required by hudi breaks livy, and the one provided by livy breaks hudi compatibility. Thank you very much for open sourcing it, I personally think it's great!
At Udemy ([https://www.udemy.com/]) we're using Apache Hudi(0.5.0) on AWS EMR (5.29.0) to ingest MySQL change data capture. Thank you for open sourcing a great project. Congratulations to TLP.
@maduxi do you mind sharing the company name or can we add this to the site?
@sungjuly same question. Can we add this to the site?
Please let me know
Hudi has been integrated into Data Lake Analytics at Aliyun to provide a datalake solution for users on OSS. Firstly, you would write data to Hudi on OSS, and then sync to DLA, after that you would query the Hudi dataset via DLA.
@vinothchandar yes, please do, thank you!
@Akshay2Agarwal Thanks! updating here #2191. Blog is great. Let me tweet/share as well. Thanks! do you have twitter handle? (nvm found it)
apna is India's largest professional networking & job opportunities platform for the rising workforce. We choose Hudi as our underlying data foundation to build a Lakehouse for apna to unleash the power of data. CC: @vinothchandar
Our data engineering team at Penn Interactive/TheScore is currently developing a new data platform after the combination of our two companies. We are in the industry of online and retail sports betting and sports media based in the US and Canada.
We are implementing a Hudi datalake as the foundational data layer of our analytics and reporting platform, using Deltastreamer and other Hudi Spark jobs to ingest data. We are streaming CDC logs from approximately 1200 tables from 75 Postgres databases within the company using PostgresDebeziumSource from Confluent Cloud Kafka topics. We are also using Deltastreamer for multiple batch ingestion jobs to further enrich the datalake.
The new data platform will power the business intelligence and compliance/reporting operations for TheScoreBet and Barstool Sportsbook, subsidiary companies of Penn Entertainment.
If you are using Hudi, it would awesome to hear from you and have you share this with the community. This way we can keep investing more in making Hudi, the better as a holistic open source big data storage solution.
Your report will be added to the powered-by page here. https://hudi.apache.org/docs/powered_by.html