apache / fluo-muchos

Apache Fluo Muchos
https://fluo.apache.org
Apache License 2.0
26 stars 37 forks source link

Normalize hdfs-site.xml across HA and non-HA cases #356

Closed arvindshmicrosoft closed 4 years ago

arvindshmicrosoft commented 4 years ago

Reference from Hadoop docs - though this is a doc for HDFS federation the specific change in this PR is orthogonal and merely aims to normalize hdfs-site.xml across both HA / standalone configurations.

keith-turner commented 4 years ago

This change allows using the nameservice_id as a stable and simple way to reference namenodes regardless of whether HA is used / or not

@arvindshmicrosoft with this change would it be ok to always use the nameservice id for hdfs_root? I looked around for where it was set and found the following in ./lib/muchos/config/base.py.

    "hdfs_root": (
        "{% if hdfs_ha %}hdfs://{{ nameservice_id }}{% else %}"
        "hdfs://{{ groups['namenode'][0] }}:8020{% endif %}"
    ),
keith-turner commented 4 years ago

Explicitly specify the data directory for the non-HA checkpoint (secondary namenode) to use (in a non-HA config)

Was the secondary NN not working before this change?

arvindshmicrosoft commented 4 years ago

Thank you @keith-turner, appreciate your review. Re: the 2 points you mentioned:

  1. Agreed, it should now be possible to just use the nameservice_id instead of the namenode[0]. I will push up an update.
  2. The secondary namenode was working fine previously, just that it would place its data into a location under /tmp (as specified in hdfs-default.xml. Consequently, muchos wipe would not clean up this folder, and I had noticed some weird problems between setup / wipe / setup cycles. Now that it is placed under the worker_data_dirs, the wipe command does clean its contents up.
arvindshmicrosoft commented 4 years ago
  1. Agreed, it should now be possible to just use the nameservice_id instead of the namenode[0]. I will push up an update.

@keith-turner I realized an issue which will prevent changing hdfs_root to use the nameservice_id right away: Since Fluo 1.2.0 does not have the recent fix to ensure hdfs-site.xml is loaded, switching hdfs_root to the nameservice will cause problems for running Fluo using Muchos. So here is what I propose:

  1. We merge this PR in as it is. There will be no breaking changes as a result of just this change.
  2. In a subsequent PR, I will add a task to check if Fluo 1.x is being run, and in that case, patch fluo-env.sh to the equivalent of the other fluo-env.sh fix
  3. In a third PR after the above one merges, I will change hdfs_root to use the nameservice_id in all cases

I believe this will be a clean and safe way to proceed. Please let me know if you foresee any issues. If you are good, I will merge this PR and create tracking issues for the 2 follow-ups as mentioned above.

keith-turner commented 4 years ago

I believe this will be a clean and safe way to proceed. Please let me know if you foresee any issues. If you are good

I like that plan and I think this is ready to merged as is.

arvindshmicrosoft commented 4 years ago

Merging this as per above discussion