Open xiangfu0 opened 8 years ago
If we change the segment upload command to take in a download url, this should just work?. Note controller will still have to download the segment to look at the metadata but it can download it locally and delete it after assigning the segments to servers
lets add another segment upload endpoint ... where instead of sending the entire tar.. we send the segment metadata only ...
Now we can either use hdfs:// prefix in the download URL and allow servers to download the segments directly from web-hdfs or let servers call controller and let controller get the segment from HDFS...
first option is better ... but there may be firewall restrictions between hadoop DC and pinot DC... so controller nodes fetching segments from HDFS for servers might be more scalable...
As you said, let controller take the url only and controller would handle download data and then assignment. This is the simplest way based on current code structure.
We may also think of modify segment push job. So we can POST the segment metadata + download url to controller, so controller won't need to handle any data transmission.
@dhaval2025 if firewall is setup between hadoop cluster and pinot cluster, then we still need to push segment tar to controller and store them in nfs and let servers download them from controller nodes.
@fx19880617 correct, but with controller nodes.. the whitelist will be only 3 to 5.. but if you allows servers to pull too, the whitelist will go to 100's of nodes...
lets controllers act as proxy.. servers talk to controller ... controllers pulls from hdfs or gives a local copy...
we should strive to keep the logic on the servers as it is. All the heavy lifting needs to be done by the controller. I like the idea of supporting multiple types of pushes.
agreed
yes, I would like to keep the segment assignment logic in controller, just to give flexible options for servers to download the data, through/ not through controller.
@icefury71
Breaking down to tasks
This is the PR for 1.Abstract segment fetcher layer to be pluggable to support different segment fetchers (http/s3/hdfs/webhdfs). https://github.com/linkedin/pinot/pull/105
Currently pinot relies on NFS as the storage on pinot controller and use vip to handle controller failover, which is not very easy/possible for many of the users to setup them. Better to leverage hdfs/s3 as the storage layer and allow server to download segment directly from hdfs/s3.