apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

Integrate WebHdfs/Hdfs/s3 as controller storage solution #95

Open xiangfu0 opened 8 years ago

xiangfu0 commented 8 years ago

Currently pinot relies on NFS as the storage on pinot controller and use vip to handle controller failover, which is not very easy/possible for many of the users to setup them. Better to leverage hdfs/s3 as the storage layer and allow server to download segment directly from hdfs/s3.

kishoreg commented 8 years ago

If we change the segment upload command to take in a download url, this should just work?. Note controller will still have to download the segment to look at the metadata but it can download it locally and delete it after assigning the segments to servers

dhaval2025 commented 8 years ago

lets add another segment upload endpoint ... where instead of sending the entire tar.. we send the segment metadata only ...

Now we can either use hdfs:// prefix in the download URL and allow servers to download the segments directly from web-hdfs or let servers call controller and let controller get the segment from HDFS...

first option is better ... but there may be firewall restrictions between hadoop DC and pinot DC... so controller nodes fetching segments from HDFS for servers might be more scalable...

xiangfu0 commented 8 years ago

As you said, let controller take the url only and controller would handle download data and then assignment. This is the simplest way based on current code structure.

We may also think of modify segment push job. So we can POST the segment metadata + download url to controller, so controller won't need to handle any data transmission.

xiangfu0 commented 8 years ago

@dhaval2025 if firewall is setup between hadoop cluster and pinot cluster, then we still need to push segment tar to controller and store them in nfs and let servers download them from controller nodes.

dhaval2025 commented 8 years ago

@fx19880617 correct, but with controller nodes.. the whitelist will be only 3 to 5.. but if you allows servers to pull too, the whitelist will go to 100's of nodes...

lets controllers act as proxy.. servers talk to controller ... controllers pulls from hdfs or gives a local copy...

kishoreg commented 8 years ago

we should strive to keep the logic on the servers as it is. All the heavy lifting needs to be done by the controller. I like the idea of supporting multiple types of pushes.

dhaval2025 commented 8 years ago

agreed

xiangfu0 commented 8 years ago

yes, I would like to keep the segment assignment logic in controller, just to give flexible options for servers to download the data, through/ not through controller.

xiangfu0 commented 8 years ago

@icefury71

Breaking down to tasks

  1. Abstract segment fetcher layer to be pluggable to support different segment fetchers (http/s3/hdfs/webhdfs).
  2. Implement webhdfs segment fetcher.
  3. Add a new segment push protocol on controller to accept segment metadata Json only. Need to double check with existing validation logic to make sure it won't break that.
  4. Modify existing segment push job to POST segment metadata Json with a corresponding download url.
xiangfu0 commented 8 years ago

This is the PR for 1.Abstract segment fetcher layer to be pluggable to support different segment fetchers (http/s3/hdfs/webhdfs). https://github.com/linkedin/pinot/pull/105