datasalt / pangool

Tuple MapReduce for Hadoop: Hadoop API made easy
http://datasalt.github.io/pangool/
Apache License 2.0
57 stars 13 forks source link

Instance files should be deleted after Job completion #17

Closed pereferrera closed 11 years ago

pereferrera commented 11 years ago

Pangool serializes everything (InputFormat, Combiner, Mapper, Reducer, OutputFormat...) so that we can use instances instead of classes. We try to use the DistributedCache when possible (it is no possible for InputFormat and OutputFormat). The current approach works fine, however, instance files may be accumulated in "hadoop.tmp.dir" and, if not deleted, the cost of loading one increases. There are two things that must be revised:

1) Deleting instance files after Job finalization, if it's possible 2) DCUtils: if(path.toString().endsWith(filePostFix)) -> Doesn't look like we need to do a postfix checking here. Getting the file straight should be possible and so the cost would be constant (we don't need to iterate over all instance files). 3) Using the local files from DistributedCache when possible, through a boolean parameter. So InputFormat and OutputFormat will pass "false" and Combiner, Mapper and Reducer "true".

Currently the main bottleneck happens in the Combiner, if there are a lot of files. The cost of loading the instance for everything else is negligible. Which makes me thing of another idea:

In any case, the other things should be done as well.

ivanprado commented 11 years ago

Instances could keep an state, so reusing instances could be dangerous because state can mutate between different Combiners. Three options:

When happen the case where there are many combiners on the same JVM so than the instance SerDe is not negligible?

I don't understand the point 3). According to what I understand, always the methods getLocalCache should be used.

Looking to the documentation, the methods purgeCache() ( http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#purgeCache(org.apache.hadoop.conf.Configuration)) and releaseCache() ( http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html#releaseCache(java.net.URI, org.apache.hadoop.conf.Configuration)) seems relevant for this case. Probable we are not releasing the cached files properly after a job execution.

2012/12/25 Pere Ferrera notifications@github.com

Pangool serializes everything (InputFormat, Combiner, Mapper, Reducer, OutputFormat...) so that we can use instances instead of classes. We try to use the DistributedCache when possible (it is no possible for InputFormat and OutputFormat). The current approach works fine, however, instance files may be accumulated in "hadoop.tmp.dir" and, if not deleted, the cost of loading one increases. There are two things that must be revised:

1) Deleting instance files after Job finalization, if it's possible 2) DCUtils: if(path.toString().endsWith(filePostFix)) -> Doesn't look like we need to do a postfix checking here. Getting the file straight should be possible and so the cost would be constant (we don't need to iterate over all instance files). 3) Using the local files from DistributedCache when possible, through a boolean parameter. So InputFormat and OutputFormat will pass "false" and Combiner, Mapper and Reducer "true".

Currently the main bottleneck happens in the Combiner, if there are a lot of files. The cost of loading the instance for everything else is negligible. Which makes me thing of another idea:

  • We could implement a JVM-local cache for Combiner instances. A static HashMap, for instance. Because Combiners are executed in the same JVM than Mappers, caching the object instances would allow us to avoid looking for instance cache for every Combiner call. Does it make sense?

In any case, the other things should be done as well.

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17.

Iván de Prado CEO & Co-founder www.datasalt.com

ivanprado commented 11 years ago

After looking at the DCUtils class and the calls to it I've got the conclusion that the DistributedCache is not used for nothing in the code. The method DCUtil.serializeToDC() calls the method DistributedCache.addCacheFile()... but no part of the code is using the DC cached version for nothing.

Instead, it seems that Pangool is always loading instances directly from the HDFS. (See method DC.locateFileInDC).

If I remember properly, we are not using the DC because files are not always present at the proper moment (InputFormat and OutputFormat instantiation?).

Therefore, the problems are:

The difficult part to solve here is the cache cleanup. This cache is located in the HDFS, and keeps the instances needed for each job. Instances are identified by a UUID generated + some fixed prefix. The difficulty is how to know if an instance is not needed anymore:

A simple idea would be to have a configurable time to live for each instance, but that is far from perfection.

Another possibility would be to associate instances with JobID's. That way for every instance we could check if the job it belongs to is still alive. But JobID is not present until job.submit() is called.

Other possibility is to delete the files after the job has finished (successfully or not). At this point, you know all instance filenames so you can remove them. Although this sounds good, it is an unsafe procedure, as the process that launched the job could be killed or could fail and in this case files wouldn't be deleted. Also, it is not very elegant, as would require a call to a method from the user. Something like:

job = MRBuilder.build();
try {
   job.submit();
} finally {
   MRBuilder.cleanJob(job);
}

Any smart idea?

(In conclusion, the three points exposed by Pere remains relevant)

ivanprado commented 11 years ago

I did some minor changes:

  1. Renamed the class DCUtils as InstancesDistributor. Updating comments
  2. Remove the postfix (point 2 of initial proposal)
pereferrera commented 11 years ago

A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.

ivanprado commented 11 years ago

That seems the most reasonable solution, but it is still not safe. It seems that is not possible to solve this problem properly without changing Hadoop codebase.

2013/1/4 Pere Ferrera notifications@github.com

A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17#issuecomment-11876468.

Iván de Prado CEO & Co-founder www.datasalt.com

epalace commented 11 years ago

Two things :

1 - I think we should mantain the createJob in any case. There are multiple APIs that deal directly with Job instead of Configuration, and they would be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs called AvroMultipleOutputs that deal directly with Job. Besides, any user-custom library that implements Job execution retrials accepting Job would break. 2- About the cleaning, why not just create a couple of static methods that deal with Job and after completion (or error) they clean the garbage ?

Eric.

On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado notifications@github.comwrote:

That seems the most reasonable solution, but it is still not safe. It seems that is not possible to solve this problem properly without changing Hadoop codebase.

2013/1/4 Pere Ferrera notifications@github.com

A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.

— Reply to this email directly or view it on GitHub< https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.

Iván de Prado CEO & Co-founder www.datasalt.com

Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17#issuecomment-11877098.

pereferrera commented 11 years ago

I see the point in Eric's concerns.

So I think by now the best solution is to implement the static method and make its usefulness clear by highlighting it in the documentation.

Everyone agree?

ivanprado commented 11 years ago

I agree with Eric. I would add some static methods to Job cleanup, and would keep an eye on this issue in the future to try to find better solutions.

Iván

2013/1/4 Eric Palacios notifications@github.com

Two things :

1 - I think we should mantain the createJob in any case. There are multiple APIs that deal directly with Job instead of Configuration, and they would be broken if we encapsulate this. For example : AvroJob.setOutputKeySchema AvroJob.setOutputValueSchema. Also Avro includes its own multiple outputs called AvroMultipleOutputs that deal directly with Job. Besides, any user-custom library that implements Job execution retrials accepting Job would break. 2- About the cleaning, why not just create a couple of static methods that deal with Job and after completion (or error) they clean the garbage ?

Eric.

On Fri, Jan 4, 2013 at 10:51 AM, Iván de Prado notifications@github.comwrote:

That seems the most reasonable solution, but it is still not safe. It seems that is not possible to solve this problem properly without changing Hadoop codebase.

2013/1/4 Pere Ferrera notifications@github.com

A possible solution would be to not give a Job instance to the user but wrap it into a Pangool class that should handle file cleaning underneath.

This new class should allow for the same functionality that Job allows such as: - Being able to run the Job in verbose mode or not, - Blocking run / non-blocking run, - isComplete(), isSuccessful().

This change would be non-backwards compatible. But we can deprecate createJob() and remove it in future versions.

— Reply to this email directly or view it on GitHub< https://github.com/datasalt/pangool/issues/17#issuecomment-11876468>.

Iván de Prado CEO & Co-founder www.datasalt.com

Reply to this email directly or view it on GitHub< https://github.com/datasalt/pangool/issues/17#issuecomment-11877098>.

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17#issuecomment-11878872.

Iván de Prado CEO & Co-founder www.datasalt.com

pereferrera commented 11 years ago

And the instance files should have a JobID since jobs could be executed in parallel and only the files from a certain JobID will be cleaned.

I can take care of this issue.

ivanprado commented 11 years ago

+1

2013/1/4 Pere Ferrera notifications@github.com

I see the point in Eric's concerns.

So I think by now the best solution is to implement the static method and make its usefulness clear by highlighting it in the documentation.

Everyone agree?

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17#issuecomment-11878941.

Iván de Prado CEO & Co-founder www.datasalt.com

ivanprado commented 11 years ago

Instance files cannot have the JobID, as it is not available until the job is executed. The UUID trick is there for solving the issue with parallel jobs.

2013/1/4 Pere Ferrera notifications@github.com

And the instance files should have a JobID since jobs could be executed in parallel and only the files from a certain JobID will be cleaned.

I can take care of this issue.

— Reply to this email directly or view it on GitHubhttps://github.com/datasalt/pangool/issues/17#issuecomment-11878988.

Iván de Prado CEO & Co-founder www.datasalt.com

pereferrera commented 11 years ago

I have added a non-static method in TupleMRBuilder and MapOnlyJobBuilder that deletes the files generated by the builders. So, the builders keep track of all the files that are being generated per each Job. In this way there is no concurrency issues.