NICTA / scoobi

A Scala productivity framework for Hadoop.
http://nicta.github.com/scoobi/
482 stars 97 forks source link

DataSink.outputSetup not called by inmemorymode #217

Closed espringe closed 11 years ago

espringe commented 11 years ago

Which is actually quite deadly -- since this is used for deleting the old output directory (When you have overwrite=true). So if it writes without deleting the old directory, it'll write over the old files. That's fine. But if the files names don't match exactly .. and then you'll now have more files than you expect in this directory (and thus more items when you go to read it), which is nasty [In my case, I was like "wtf? Why am I training with x2 the amount of data in my training set" but I only caught that out of luck]

Also instead of having outputCheck outputConfigure and outputSetup that all get called, perhaps its better to have the whole thing in just one function? Just to make it simpler and less prone to mistakes like this