Normalize hdfs-site.xml across HA and non-HA cases

apache / fluo-muchos

Apache Fluo Muchos

https://fluo.apache.org

Apache License 2.0

26 stars 37 forks source link

Normalize hdfs-site.xml across HA and non-HA cases #356

Closed arvindshmicrosoft closed 4 years ago

arvindshmicrosoft commented 4 years ago

Explicitly specify the data directory for the non-HA checkpoint (secondary namenode) to use (in a non-HA config)
Rearrange some elements in the hdfs-site.xml file in a more logical order
Use HDFS namespace (using nameservice_id) in non-HA cases as well (HA config already uses it). This change allows using the nameservice_id as a stable and simple way to reference namenodes regardless of whether HA is used / or not
Configure the use of ConfiguredFailoverProxyProvider in non-HA cases as well (HA config already used it), so that namespace can be resolved to physical namenodes in all cases

Reference from Hadoop docs - though this is a doc for HDFS federation the specific change in this PR is orthogonal and merely aims to normalize hdfs-site.xml across both HA / standalone configurations.

keith-turner commented 4 years ago

This change allows using the nameservice_id as a stable and simple way to reference namenodes regardless of whether HA is used / or not

@arvindshmicrosoft with this change would it be ok to always use the nameservice id for hdfs_root? I looked around for where it was set and found the following in ./lib/muchos/config/base.py.

    "hdfs_root": (
        "{% if hdfs_ha %}hdfs://{{ nameservice_id }}{% else %}"
        "hdfs://{{ groups['namenode'][0] }}:8020{% endif %}"
    ),

keith-turner commented 4 years ago

Explicitly specify the data directory for the non-HA checkpoint (secondary namenode) to use (in a non-HA config)

Was the secondary NN not working before this change?

arvindshmicrosoft commented 4 years ago

Thank you @keith-turner, appreciate your review. Re: the 2 points you mentioned:

Agreed, it should now be possible to just use the nameservice_id instead of the namenode[0]. I will push up an update.
The secondary namenode was working fine previously, just that it would place its data into a location under /tmp (as specified in hdfs-default.xml. Consequently, muchos wipe would not clean up this folder, and I had noticed some weird problems between setup / wipe / setup cycles. Now that it is placed under the worker_data_dirs, the wipe command does clean its contents up.

arvindshmicrosoft commented 4 years ago

Agreed, it should now be possible to just use the nameservice_id instead of the namenode[0]. I will push up an update.

@keith-turner I realized an issue which will prevent changing hdfs_root to use the nameservice_id right away: Since Fluo 1.2.0 does not have the recent fix to ensure hdfs-site.xml is loaded, switching hdfs_root to the nameservice will cause problems for running Fluo using Muchos. So here is what I propose:

We merge this PR in as it is. There will be no breaking changes as a result of just this change.
In a subsequent PR, I will add a task to check if Fluo 1.x is being run, and in that case, patch fluo-env.sh to the equivalent of the other fluo-env.sh fix
In a third PR after the above one merges, I will change hdfs_root to use the nameservice_id in all cases

I believe this will be a clean and safe way to proceed. Please let me know if you foresee any issues. If you are good, I will merge this PR and create tracking issues for the 2 follow-ups as mentioned above.

keith-turner commented 4 years ago

I believe this will be a clean and safe way to proceed. Please let me know if you foresee any issues. If you are good

I like that plan and I think this is ready to merged as is.

arvindshmicrosoft commented 4 years ago

Merging this as per above discussion

357 is the first of the 2 follow-up PRs; it fixes the fluo-env.sh so that Fluo runs with a HDFS nameservice based DFS root
358 simplifies hdfs_root as per the above discussion. It should be merged only after #357 is merged.

apache / fluo-muchos

Normalize hdfs-site.xml across HA and non-HA cases #356

357 is the first of the 2 follow-up PRs; it fixes the fluo-env.sh so that Fluo runs with a HDFS nameservice based DFS root

358 simplifies hdfs_root as per the above discussion. It should be merged only after #357 is merged.