Alright, so few cases I didn't know exactly how to deal with (also, most of them are marked as TODOs in code):
Firstly, what Janek pointed out, during actual storage revival (and moving offset) do we have to do anything more sophisticated than moving file iterator one by one? This might be a little bit slow but all of these packages doesn't seem to have any seek-like method implemented, so that would be a bit complex (moving file iterators on our own by remembering bytes read, bleh)
CSV/Excel: If the above point is fine, are readRecordFromFileWithInitialize() methods acceptable, I mean, to make it quite simple rather than just read raw line from file these functions also parse records, but thanks to that we can also handle header rows and stuff like that, so the code is just much simpler and easier to understand (although, when moving by offset we actually parse records and drop them afterwards)
JSON/CSV/Excel: when couldn't read line, error is returned, but there are few ways to handle this:
Drop the entire batch and instantly return error -> currently implemented (but pretty silly imho)
Push first records that are correct (part of batch) then return error -> quite easy to do
Maybe even don't bother and continue reading like this incorrect record didn't appear...
...
... What should happen to worker after file read error occurs? Currently (copy pasted from kafka) we just reenter outer loop so we actually continue reading the rest of the file (example: let's say we have 3 rows where 2nd one is somehow incorrect and batchSize = 2. we will drop the first full batch and then continue reading the file: output -> 3rd row). I believe we should distinguish file errors (which instantly closes worker) from standard not-nil errors like queue push errors or badger errors (which reinitialize reding from storage).
So, just to start with. Regarding the intro you wrote:
Seeking one by one is ok imo
You definitely don't want to just skip on error. Kafka reinitializes the offset, so will try to read the batch again. Failing the whole batch is ok. Just retry it afterwards, as in kafka. Reset the offset basically. And log something.
Alright, so few cases I didn't know exactly how to deal with (also, most of them are marked as TODOs in code):
readRecordFromFileWithInitialize()
methods acceptable, I mean, to make it quite simple rather than just read raw line from file these functions also parse records, but thanks to that we can also handle header rows and stuff like that, so the code is just much simpler and easier to understand (although, when moving by offset we actually parse records and drop them afterwards)