apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.2k stars 2.38k forks source link

Tracking ticket for reporting Hudi usages from the community #661

Open vinothchandar opened 5 years ago

vinothchandar commented 5 years ago

If you are using Hudi, it would awesome to hear from you and have you share this with the community. This way we can keep investing more in making Hudi, the better as a holistic open source big data storage solution.

Your report will be added to the powered-by page here. https://hudi.apache.org/docs/powered_by.html

tcbakes commented 5 years ago

Not using it in prod yet, but very early stages of investigating its usage for ingestion use cases at Intuit.

rtjarvis commented 5 years ago

At EMIS Health ([https://www.emishealth.com/]) we're using HUDI in production on AWS. We're the largest provider of Primary Care IT software in the UK and store over 500Bn healthcare records.

We use HUDI as a way to keep our analytics platform up-to-date from the source. We use Presto to query the data that is written.

We've only been using it for about 6 months - so far so good though!

leilinen commented 5 years ago

We are exploring incremental data updates with HUDI, and rewrited code of consuming data. We hope HUDI help us increase incremental data write speed and save computing resouces. Now we are still in the exploration stage. It seems good for our needs.

SemanticBeeng commented 5 years ago

Planning to keep vision & use case material here https://cwiki.apache.org/confluence/display/HUDI/Hudi+for+Continuous+Deep+Analytics.

Please capture your own use cases here for consideration. :-)

amarnathv9 commented 5 years ago

I am doing POC on MapR platform to build the datafabric in incremental fashion and for ETL offload needs.

vinothchandar commented 5 years ago

@leilinen @amaranathv do you mind providing details of your organizations/teams to add to the blurb

RonBarabash commented 5 years ago

Hey, We are using Hudi at Yotpo for several usages. Firstly, we integrated Hudi as a writer in our open source ETL framework: https://github.com/YotpoLtd/metorikku We currently use it as an output writer for a CDC pipeline - Events that are being generated from a database binlog streams to Kafka and then are written to S3. The cool thing with Hudi is that it helps maintain a consistent view of the data in our data lake when we dont need to merge the deletes and upserts ourselves. A blog post on that is being written and ill share it here once its done :)

More usages are coming, 10x Ron

pratyakshsharma commented 4 years ago

Trying to build CDC pipeline for capturing sql changes and prepare data for analytics use cases. Not in production yet, right now trying to modify it as per the use case and organisation specific challenges. We are using Hudi 0.4.7 with spark 2.3.2, Hadoop 3.1.0, hive 3.1.0 and spark-streaming-kafka-0-10_2.11 library.

smdahmed commented 4 years ago

We have used it in Investment Banking area from dev all the way into production. Hudi has been mainly used to build upsertable gas, energy and oil markets data where deals often need to get updated.

broussea1901 commented 4 years ago

We're currently deploying HUDI 0.5.0 in a EU Bank. Not in Prod yet. HUDI will be used to provide ACID ability for data ingestion batches and streams needing such feature (sourcing update files & CDC streams to HDFS).

garyli1019 commented 4 years ago

We have been using HUDI to manage a data lake with 500+TB manufacturing data for almost a year now. In the IoT world, late arrival and update is a very common scenario and HUDI can handle it perfectly for us. We use Impala to query the data. The small file handling with easy partitioning feature of HUDI let us build an efficient structure to make the query on the fly. In addition, the incremental pulling makes the expensive batch jobs like aggregating BI dashboards and maintaining a large graph database much more efficient with the custom merging feature between the historical data and change data.

prashanthpdesai commented 4 years ago

we are trying to use HUDI Deltastreamer to read from compacted Kafka topic in production environment and pull the messages incrementally and persist the data in MapR Platform in regular interval , tried initially with MOR with continuous mode of streamer faced small file issue , now planning to run it in minibatch(every 2 hours) with COPY_ON_WRITE to avoid compaction etc. we are facing offsetoutofrange exception ,tried with both option of auto.offset.reset= earliest as well as latest encountered the same offsetoutofrange exception. we notice that in log that auto.offset.reset is overriding to none WARN kafka010.KafkaUtils: overriding enable.auto.commit to false for executor WARN kafka010.KafkaUtils: overriding auto.offset.reset to none for executor

vinothchandar commented 4 years ago

@garyli1019 do you mind sharing your company name/logo and is it okay to list this on powered_by?

@prashanthpdesai Let's offline the small file issue.. Interested to understand why it does not work for you in MOR.. As I understand it, you are still in pre-prod?

prashanthpdesai commented 4 years ago

@vinothchandar : sure , yes we are still in pre prod .

garyli1019 commented 4 years ago

@vinothchandar Will do once I clear some internal process

maduxi commented 4 years ago

We are using it at an online casino based in Malta. We are using it in production, but only for a small part of our dataset. It's a large table that has frequent updates, and with hudi we are able to update it frequently and still not face any issue while querying. I just have a challenge now, and it's to make it available to the analysts in jupyter using livy. I was able to add the hudi jar to livy, but the httpclient lib required by hudi breaks livy, and the one provided by livy breaks hudi compatibility. Thank you very much for open sourcing it, I personally think it's great!

sungjuly commented 4 years ago

At Udemy ([https://www.udemy.com/]) we're using Apache Hudi(0.5.0) on AWS EMR (5.29.0) to ingest MySQL change data capture. Thank you for open sourcing a great project. Congratulations to TLP.

vinothchandar commented 4 years ago

@maduxi do you mind sharing the company name or can we add this to the site?

@sungjuly same question. Can we add this to the site?

Please let me know

leesf commented 4 years ago

Hudi has been integrated into Data Lake Analytics at Aliyun to provide a datalake solution for users on OSS. Firstly, you would write data to Hudi on OSS, and then sync to DLA, after that you would query the Hudi dataset via DLA.

zhengcanbin commented 4 years ago

EMR from Tencent Cloud has integrated Hudi as one of its BigData components since V2.2.0. Using Hudi, the end-users can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on HDFS/COS/CHDFS using Apache Parquet and Apache Avro.

sungjuly commented 4 years ago

@vinothchandar yes, please do, thank you!

Akshay2Agarwal commented 3 years ago

Grofers has integrated hudi in its central pipelines for replicating backend database CDC into the warehouse. We have also published the blog around its integration in our data platform.

vinothchandar commented 3 years ago

@Akshay2Agarwal Thanks! updating here #2191. Blog is great. Let me tweet/share as well. Thanks! do you have twitter handle? (nvm found it)

Sarfaraz-214 commented 1 year ago

apna is India's largest professional networking & job opportunities platform for the rising workforce. We choose Hudi as our underlying data foundation to build a Lakehouse for apna to unleash the power of data. CC: @vinothchandar

sydneyhoran commented 1 year ago

Our data engineering team at Penn Interactive/TheScore is currently developing a new data platform after the combination of our two companies. We are in the industry of online and retail sports betting and sports media based in the US and Canada.

We are implementing a Hudi datalake as the foundational data layer of our analytics and reporting platform, using Deltastreamer and other Hudi Spark jobs to ingest data. We are streaming CDC logs from approximately 1200 tables from 75 Postgres databases within the company using PostgresDebeziumSource from Confluent Cloud Kafka topics. We are also using Deltastreamer for multiple batch ingestion jobs to further enrich the datalake.

The new data platform will power the business intelligence and compliance/reporting operations for TheScoreBet and Barstool Sportsbook, subsidiary companies of Penn Entertainment.