datahub-project / datahub

The Metadata Platform for your Data Stack
https://datahubproject.io
Apache License 2.0
9.81k stars 2.89k forks source link

HDFS and HIVE Data Set Loading #326

Closed sunny1978 closed 4 years ago

sunny1978 commented 7 years ago

Hi - I am looking for ways to load HDFS and HIVE Datasets and have the ETL Jobs pickup and run.

My Cluster: HDP2.5 Nodes: 1 (node09.example.com) Its an ALL-In-One Node. WhereHows: Got latest. Build it and started web and backend_services. Able to access UI just fine.

What am I looking for? HDFS, Hive, Hbase to start with and then Kafka later on.

Based on what I read so far - I prepared and loaded the following - However no ETL jobs running.

Kindly help me. I am desperately need this service up and running and monitoring data flowing in to cluster.
====== My Inserts ======

--For HIVE insert into cfg_application (app_id,app_code,description,parent_app_id,app_status,uri) values (101,'HIVE WH','HIVE WH',0,'A','jdbc:mysql://node09.example.com/hive?createDatabaseIfNotExist=true');

insert into cfg_database (db_id,db_code,primary_dataset_type,description,is_logical,associated_dc_num,cluster,cluster_size,jdbc_url,uri) values (101,'HIVE WH','HIVE','HIVE WH','1',1,'N109HDP24',1,'jdbc:mysql://node09.example.com/hive?createDatabaseIfNotExist=true','jdbc:mysql://node09.example.com/hive?createDatabaseIfNotExist=true');

INSERT INTO wh_etl_job (wh_etl_job_id,wh_etl_job_name,wh_etl_type,cron_expr,ref_id,ref_id_type,is_active) VALUES (11,'HIVE_DATASET_METADATA_ETL','DATASET','5 ?',101,'DB','Y'); INSERT INTO wh_etl_job_property VALUES (110,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.metastore.jdbc.url','jdbc:mysql://node09.example.com/hive','N','url to connect to hive metastore'); INSERT INTO wh_etl_job_property VALUES (111,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.metastore.jdbc.driver','com.mysql.jdbc.Driver','N',NULL); INSERT INTO wh_etl_job_property VALUES (112,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.metastore.password','hive','N',NULL); INSERT INTO wh_etl_job_property VALUES (113,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.metastore.username','hive','N',NULL); INSERT INTO wh_etl_job_property VALUES (114,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.schema_json_file','/var/tmp/wherehows/hive_schema.json','N',NULL); INSERT INTO wh_etl_job_property VALUES (115,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.schema_csv_file','/var/tmp/wherehows/hive_schema.csv','N',NULL); INSERT INTO wh_etl_job_property VALUES (116,'HIVE_DATASET_METADATA_ETL',101,'DB','hive.field_metadata','/var/tmp/wherehows/hive_field_metadata.csv','N',NULL);

--FOR HDFS insert into cfg_application (app_id,app_code,description,parent_app_id,app_status,uri) values (100,'HDFS WH','HDFS WH',0,'A','hdfs://node09.example.com:8020');

insert into cfg_database (db_id,db_code,primary_dataset_type,description,is_logical,associated_dc_num,cluster,cluster_size,jdbc_url,uri) values (100,'HDFS WH','HDFS','HDFS WH','1',1,'N109HDP24',1,'hdfs://node09.example.com:8020','hdfs://node09.example.com:8020');

INSERT INTO wh_etl_job (wh_etl_job_id,wh_etl_job_name,wh_etl_type,cron_expr,ref_id,ref_id_type,is_active) VALUES (10,'HADOOP_DATASET_METADATA_ETL','DATASET','5 ?',100,'APP','Y');

INSERT INTO wh_etl_job_property VALUES (151,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.cluster','N109HDP24','N',''); INSERT INTO wh_etl_job_property VALUES (152,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.remote.machine','node09.example.com','N',''); INSERT INTO wh_etl_job_property VALUES (153,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.private_key_location','/root/.ssh/id_rsa','N',''); INSERT INTO wh_etl_job_property VALUES (154,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.remote.jar','','N',''); INSERT INTO wh_etl_job_property VALUES (155,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.remote.user','root','N',''); INSERT INTO wh_etl_job_property VALUES (156,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.remote.raw_metadata','','N',''); INSERT INTO wh_etl_job_property VALUES (157,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.remote.sample','','N',''); INSERT INTO wh_etl_job_property VALUES (158,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.local.field_metadata','/var/tmp/wherehows/hdfs_field_meta','N',''); INSERT INTO wh_etl_job_property VALUES (159,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.local.metadata','/var/tmp/wherehows/hdfs_meta','N',''); INSERT INTO wh_etl_job_property VALUES (160,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.local.raw_metadata','/var/tmp/wherehows/hdfs_raw_meta','N',''); INSERT INTO wh_etl_job_property VALUES (161,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.local.sample','/var/tmp/wherehows/hdfs_sample','N',''); INSERT INTO wh_etl_job_property VALUES (162,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.white_list','/user,/hive,/apps,/hdp,/mapred','N',''); INSERT INTO wh_etl_job_property VALUES (163,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.num_of_thread','2','N',''); INSERT INTO wh_etl_job_property VALUES (164,'HADOOP_DATASET_METADATA_ETL',100,'DB','hdfs.file_path_regex_source_map','','N','');

INSERT INTO wh_property VALUES ('wherehows.app_folder','/var/tmp/wherehows','N',NULL), ('wherehows.db.driver','com.mysql.jdbc.Driver','N',NULL),('wherehows.db.jdbc.url','jdbc:mysql://localhost/wherehows','N',NULL), ('wherehows.db.password','wherehows','N',NULL),('wherehows.db.username','wherehows','N',NULL), ('wherehows.encrypt.master.key.loc','/var/tmp/wherehows/.wherehows/master_key','N',NULL), ('wherehows.ui.tree.dataset.file','/var/tmp/wherehows/resource/dataset.json','N',NULL), ('wherehows.ui.tree.flow.file','/var/tmp/wherehows/resource/flow.json','N',NULL);

sunny1978 commented 7 years ago

Where can I find "wherehows.dump" mentioned in backend-service page? Just those 3 table inserts? Am I missing any more?

keremsahin1 commented 4 years ago

Dear issue owner,

Thanks for your interest in WhereHows. We have recently announced DataHub which is the rebranding of WhereHows. LinkedIn improved the architecture of WhereHows and rebranded WhereHows into DataHub and replaced its metadata infrastructure in this direction. DataHub is a more advanced and improved metadata management product compared to WhereHows.

Unfortunately, we have to stop supporting WhereHows to better focus on DataHub and offer more help to DataHub users. Therefore, we will drop all issues related to WhereHows and will not accept any contribution for it. Active development for DataHub has already started on datahub branch and will continue to live in there until it's finally merged to master and project is renamed to DataHub.

Please check the datahub branch to get familar with DataHub.

Best, DataHub team