CachedFaucet Maintenance

mshahriarinia commented 11 years ago

CachedFaucet is overly confusing to debug. I modified the URL at getRDDZ to http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/%s/%s to work with some files from the new corpus and at the range added the first date-hour which only contains one single file

sr.addFromDate("2011-10-05")
sr.addFromHour(00)
sr.addToHour(00)

but for some reason it gets Fetching, decrypting with GrabGPG(2011-10-07-14,social.06613a7d494eec94f8ee2aa583107a32.xz.gpg)

This very simple modification took so much time to figure out what's wrong. We need to have a central place to get parameters from avoid having duplicates on e.g. where files come from etc. I was trying to modify it to work with local files that's a whole other story.

cegme commented 11 years ago

The commit I did yesterday removed that getRDDZ function. It was not used anywhere in the code. Since no function called getRDDZ, modifying it would be a waste of time.

The CachedFaucet called a method in the super class getGPGCompressed this does the fetching from over the network. CachedFaucet calls this function from a class called getStreamCompressed.

The StreamRange class gets all the date/fileNames pairs. CachedFaucet uses methods in the Faucet.scala super class to fetch these values. There are two methods in Faucet.scala that could do this grabbing.

grabGPG: This function grabs the filename from the neo server decrypts it and decompresses it and returns the ByteArrayOutputStream.
grabGPGCompressed: This function grabs the file name and decrypts it. But it does not decompress it. It returns the compressed stream.

I tried both of these methods, because if you decompress a file using the fileSystem the JVM doesn't get a chance to prepare for the large size. If we instead return the compressed file and use Java libraries the JVM has a chance to allocate space. I thought this would help with some of the JavaHeap errors. (It probably slows down the system too).

cegme commented 11 years ago

Also, if you want to use just one hour, you should specify the end date also. For example

sr.addFromDate("2011-10-05")
sr.addFromHour(00)
sr.addToHour(00)
sr.addToDate("2011-10-05")

cegme / gatordsr

CachedFaucet Maintenance #43