cannontwo commented 7 years ago

Background: RandomWriter

RandomWriter is one of the simplest programs that RDFS should be able to support, as it simply writes 10GB of random data to each DataNode. However, attempting to run this Hadoop example program generates obtuse errors regarding issues with getting blocks for writing, as well as an array indexing error, which likely should not be exposed to users. By fixing this issue, we may get closer to understanding why more complicated example programs such as Teragen do not currently work.

Sample Output

See out.txt.

Reproducing

Inside the Vagrant machine for RDFS, first start ZooKeeper, a Rice-NameNode, and a Rice-DataNode. Then, in a separate terminal window also within the Vagrant machine, change directory to /home/vagrant/hadoop/share/hadoop/mapreduce. Running yarn jar hadoop-mapreduce-examples-2.8.1.jar randomwriter /random should then reproduce this error.

Resolution Criteria

Resolving this issue consists of identifying the problem causing this error, submitting a pull request that fixes it or explaining in a comment how to fix it, and demonstrating a successful run of RandomWriter.

cannontwo commented 7 years ago

We (cmj2) believe that this bug arises because the namenode does not recover in the case that the only datanode holding a block goes down.

cannontwo commented 7 years ago

Upon further investigation, running RandomWriter seems to introduce a persistent bug into the functioning of RDFS in all subsequent calls. After running RandomWriter (or Teragen), subsequent attempts to read or write data in RDFS persistently throw ArrayIndexOutOfBoundsExceptions, and only vagrant destroy; vagrant up seems to fix this bug. See randomwriter_before.txt, randomwriter_after.txt, and this output from copyFromLocal (which worked before running RandomWriter) copyFromLocal_out.txt. I hypothesize that this error stems from DataNode reading timing out during RandomWriter, and the NameNode not recovering gracefully from the disappearance of the only DataNode holding a block.

ghost commented 7 years ago

I can verify that last comment. It makes debugging this pretty hard. Sources from last year have admitted there are likely NameNode problems when there are zero DataNodes (not just when there are zero DataNodes holding a specific block). I also checked the "backing store" of the target DataNode. It does hold a ton of 'random' data, despite the application technically failing and introducing a persistent bug.

Next steps:

Make sure the VM has enough memory for a 10GB write, or change the app's configuration to write less.
Learn how to completely clear data from zookeeper without having to vagrant destroy; vagrant up. This will speed up development time for the rest of the semester.

Suggestions from last year's team:

It's helpful to track the "state" of the system using the zookeeper CLI so we can directly query the ZK filesystem.
add some temporary print logs to NativeFS::NativeFS(std::string fname)
- Check that the blocklists of the file system are reinitialized correctly after DataNode shutdown
- If so, check that they are re-registered correctly on the NameNode.

adamnsm1 commented 7 years ago

Also managed to successfully reproduce the bug a couple times. I followed up on Eddie's next steps, and it looks to me like the default VM is set up with only 2 GB of memory and 1 GB of RAM from reading the Vagrantfile:

# Give 1 gb of ram to the vm, may change if it's not enough
        v.customize ["modifyvm", :id, "--memory", 1024]
        v.customize ["setextradata", :id,
            "VBoxInternal2/SharedFoldersEnableSymlinksCreate//home/vagrant/rdfs", "1"]

        unless File.exist?(file_to_disk)
            v.customize ['createhd', '--filename', file_to_disk, '--size', 2 * 1024]
        end
        v.customize ['storageattach', :id, '--storagectl', 'SATA Controller', '--port', 1, '--device', 0, '--type', 'hdd', '--medium', file_to_disk]

I attempted to change it to 20GB of memory by changing the constant value in the line "v.customize ['createhd', '--filename', file_to_disk, '--size', 2 * 1024]", halting my VM and then rebooting it. Still encountered the same bug, so now I will try destroying and re-creating the virtual machine entirely with the same settings.

Additionally, I think the raw_storage.vdi contains the persistent memory of the virtual machine, so it's possible that resetting that to 0 could fix our persistent error without having to re-create the virtual machine. Next time I reproduce the bug I'll try doing that instead.

ghost commented 7 years ago

Do we know what raw_storage.vdi actually is? I could only find it in the .gitignore

adamnsm1 commented 7 years ago

I know it goes from all 0's to some data followed by all 0's after I run the test, which is why I'm guessing it's the VM's disk space. Haven't read about it in documentation though.

comp413-2017 / RDFS

RandomWriter HDFS Example Fails #3

Background: RandomWriter

Sample Output

Reproducing

Resolution Criteria