Open cannontwo opened 7 years ago
We (cmj2) believe that this bug arises because the namenode does not recover in the case that the only datanode holding a block goes down.
Upon further investigation, running RandomWriter seems to introduce a persistent bug into the functioning of RDFS in all subsequent calls. After running RandomWriter (or Teragen), subsequent attempts to read or write data in RDFS persistently throw ArrayIndexOutOfBoundsExceptions, and only vagrant destroy; vagrant up
seems to fix this bug. See randomwriter_before.txt, randomwriter_after.txt, and this output from copyFromLocal (which worked before running RandomWriter) copyFromLocal_out.txt. I hypothesize that this error stems from DataNode reading timing out during RandomWriter, and the NameNode not recovering gracefully from the disappearance of the only DataNode holding a block.
I can verify that last comment. It makes debugging this pretty hard. Sources from last year have admitted there are likely NameNode problems when there are zero DataNodes (not just when there are zero DataNodes holding a specific block). I also checked the "backing store" of the target DataNode. It does hold a ton of 'random' data, despite the application technically failing and introducing a persistent bug.
Next steps:
vagrant destroy; vagrant up
. This will speed up development time for the rest of the semester.Suggestions from last year's team:
NativeFS::NativeFS(std::string fname)
Also managed to successfully reproduce the bug a couple times. I followed up on Eddie's next steps, and it looks to me like the default VM is set up with only 2 GB of memory and 1 GB of RAM from reading the Vagrantfile:
# Give 1 gb of ram to the vm, may change if it's not enough
v.customize ["modifyvm", :id, "--memory", 1024]
v.customize ["setextradata", :id,
"VBoxInternal2/SharedFoldersEnableSymlinksCreate//home/vagrant/rdfs", "1"]
unless File.exist?(file_to_disk)
v.customize ['createhd', '--filename', file_to_disk, '--size', 2 * 1024]
end
v.customize ['storageattach', :id, '--storagectl', 'SATA Controller', '--port', 1, '--device', 0, '--type', 'hdd', '--medium', file_to_disk]
I attempted to change it to 20GB of memory by changing the constant value in the line "v.customize ['createhd', '--filename', file_to_disk, '--size', 2 * 1024]", halting my VM and then rebooting it. Still encountered the same bug, so now I will try destroying and re-creating the virtual machine entirely with the same settings.
Additionally, I think the raw_storage.vdi contains the persistent memory of the virtual machine, so it's possible that resetting that to 0 could fix our persistent error without having to re-create the virtual machine. Next time I reproduce the bug I'll try doing that instead.
Do we know what raw_storage.vdi actually is? I could only find it in the .gitignore
I know it goes from all 0's to some data followed by all 0's after I run the test, which is why I'm guessing it's the VM's disk space. Haven't read about it in documentation though.
Background: RandomWriter
RandomWriter is one of the simplest programs that RDFS should be able to support, as it simply writes 10GB of random data to each DataNode. However, attempting to run this Hadoop example program generates obtuse errors regarding issues with getting blocks for writing, as well as an array indexing error, which likely should not be exposed to users. By fixing this issue, we may get closer to understanding why more complicated example programs such as Teragen do not currently work.
Sample Output
See out.txt.
Reproducing
Inside the Vagrant machine for RDFS, first start ZooKeeper, a Rice-NameNode, and a Rice-DataNode. Then, in a separate terminal window also within the Vagrant machine, change directory to
/home/vagrant/hadoop/share/hadoop/mapreduce
. Runningyarn jar hadoop-mapreduce-examples-2.8.1.jar randomwriter /random
should then reproduce this error.Resolution Criteria
Resolving this issue consists of identifying the problem causing this error, submitting a pull request that fixes it or explaining in a comment how to fix it, and demonstrating a successful run of RandomWriter.