deanwampler / spark-scala-tutorial

A free tutorial for Apache Spark.
Other
985 stars 430 forks source link

Files do not get processed in SparkStreaming11Main when file is dropped manually. #21

Closed wpoosanguansit closed 8 years ago

wpoosanguansit commented 8 years ago

Hi,

I am testing out the code on Windows 7 machine and the example for SparkStreaming11Main seems to work fine. However when I commented out on line 93:

startDirectoryDataThread(in, data)

and drop the files manually into the streaming-input folder, nothing is being processed. Do we have to setup anything to have it working manually? Thanks for your help.

deanwampler commented 8 years ago

Check the time stamps of the files. They need to be newer than the time stamp of the last Spark minibatch job or Spark will ignore them. I don't know Windows, but if you move the file, it might have the old time stamp. Copying the file should work better. That's what the startDirectoryDataThread code basically does. (I'm assuming that's working for you.)

wpoosanguansit commented 8 years ago

Thanks for your help. That does not seem to be the case because I do copy files over. Furthermore, I did test it on Unbuntu and it just worked fine on there. So it looks more like it is something specific to Windows. But I am just not sure how to get that resolved at the moment.

deanwampler commented 8 years ago

Thanks for investigating further. I'm not sure what to suggest at the moment.

deanwampler commented 8 years ago

I looked into this with a Windows 8 environment. If you use copy data\kjvdat.txt tmp\streaming-input\1.txt (for example), it keeps the same creation and modification times for the target file, so Spark doesn't recognize it as new. However, if you use more < data\kjvdat.txt > tmp\streaming-input\1.txt it does truly consider it new and gives it new creation and modification times. This is effectively what DataDirectoryServer does, as well. So, I'm going to close this one.