The purpose of this issue is to maintain a list of the areas where data loss can occur in the Aleph2 platform as a result of: network failures, software exceptions, process failures, and node failures
Currently BeFileInputReader deletes/archives an input file as soon as it has read the last record - if the batch enrichment handler exceptions out then the file is removed
(if this is a transient then this could be worked around by removing files in the close() instead .. if it is a permanent error that requires sw change, then it is arguable if the file should remain in the input directory or not - can you even tell from within the record reader if the task has failed? Also you're going to get dups ... really you need to set some flag when all records for a given file have been written and only delete the file then, but that seems difficult in Hadoop)
In general putting better protection around batch calls seems like a good idea, if a fail occurs could write that batch into an error list (which lasts for some time period and is indexed by job) instead at least (and never write in the middle of a batch, always treat output as a final batch job that does nothing else)
Currently all ES and HDFS bulk write requests (and all requests from a context) go into a black hole that has no acks.
Therefore if the a) client dies unexpectedly b) the write fails (overload/malformed) then it is unknown whether they the doc has been written or not
One option for b) is at least to write all such recods into some pretty safe channel (eg file)
a) is more of a problem, maybe need to return a batch id future that is completed when all records from that batch have been written, then that can be used for Storm/kafka acking/deleting files etc
I think that a failing kafka write (eg due to network connectivity) will exception, which is not idea but better than nothing
The purpose of this issue is to maintain a list of the areas where data loss can occur in the Aleph2 platform as a result of: network failures, software exceptions, process failures, and node failures
BeFileInputReader
deletes/archives an input file as soon as it has read the last record - if the batch enrichment handler exceptions out then the file is removed