Open mshahriarinia opened 11 years ago
The commit I did yesterday removed that getRDDZ function. It was not used anywhere in the code. Since no function called getRDDZ
, modifying it would be a waste of time.
The CachedFaucet called a method in the super class getGPGCompressed this does the fetching from over the network. CachedFaucet calls this function from a class called getStreamCompressed.
The StreamRange class gets all the date/fileNames pairs. CachedFaucet uses methods in the Faucet.scala super class to fetch these values. There are two methods in Faucet.scala that could do this grabbing.
I tried both of these methods, because if you decompress a file using the fileSystem the JVM doesn't get a chance to prepare for the large size. If we instead return the compressed file and use Java libraries the JVM has a chance to allocate space. I thought this would help with some of the JavaHeap errors. (It probably slows down the system too).
Also, if you want to use just one hour, you should specify the end date also. For example
sr.addFromDate("2011-10-05")
sr.addFromHour(00)
sr.addToHour(00)
sr.addToDate("2011-10-05")
CachedFaucet is overly confusing to debug. I modified the URL at getRDDZ to
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/%s/%s
to work with some files from the new corpus and at the range added the first date-hour which only contains one single filebut for some reason it gets
Fetching, decrypting with GrabGPG(2011-10-07-14,social.06613a7d494eec94f8ee2aa583107a32.xz.gpg)
This very simple modification took so much time to figure out what's wrong. We need to have a central place to get parameters from avoid having duplicates on e.g. where files come from etc. I was trying to modify it to work with local files that's a whole other story.