cbft gets stuck if you put a non-json doc into a bucket.
That is, cbft seems to infinite loop trying to parse a document value as JSON when it isn't JSON.
The current theory is that when bleve Batch ingest encounters an error on some document in the batch (during mapDocument()), then bleve Batch() will return an error. (But we don't know which exact doc in the batch had an error.) This error in turn propagates back to cbft and then in turn back to cbdatasource, which in turn shuts down its DCP connection, and restarts the entire DCP stream.
But, the DCP stream (correctly) just restarts where it left off (re-streaming over the same problematic docs). So the error repeats.
cbft might need to use a lower-level bleve API instead of the easy-to-use high-level "porcelain" bleve Batch API, so that cbft can detect that there's a document mapping / character-analyzer error and just ignore that document in the batch. (cbft also probably needs to track and expose a "side area" of errors for things like document parsing/analysis errors, like "these are some of the docs which recently couldn't be indexed and the related error info".)
In more detail, the actual error appears like this...
2015/02/24 21:39:47 feed_dcp: on error: intrapop-leveldb.1_4fc326d5c2fec085: error: HandleRecv, err: invalid character 'h' looking for beginning of value
The document was just the (non-JSON) bytes of "hello".
cbft gets stuck if you put a non-json doc into a bucket.
That is, cbft seems to infinite loop trying to parse a document value as JSON when it isn't JSON.
The current theory is that when bleve Batch ingest encounters an error on some document in the batch (during mapDocument()), then bleve Batch() will return an error. (But we don't know which exact doc in the batch had an error.) This error in turn propagates back to cbft and then in turn back to cbdatasource, which in turn shuts down its DCP connection, and restarts the entire DCP stream.
But, the DCP stream (correctly) just restarts where it left off (re-streaming over the same problematic docs). So the error repeats.
cbft might need to use a lower-level bleve API instead of the easy-to-use high-level "porcelain" bleve Batch API, so that cbft can detect that there's a document mapping / character-analyzer error and just ignore that document in the batch. (cbft also probably needs to track and expose a "side area" of errors for things like document parsing/analysis errors, like "these are some of the docs which recently couldn't be indexed and the related error info".)
In more detail, the actual error appears like this...
The document was just the (non-JSON) bytes of "hello".