couchbaselabs / cbft

*THIS PROJECT HAS MOVED* from couchbaselabs TO: https://github.com/couchbase/cbft -- no further development will be done here on couchbaselabs/cbft
Other
27 stars 5 forks source link

cbft can infinite loop on non-json document #31

Closed steveyen closed 9 years ago

steveyen commented 9 years ago

cbft gets stuck if you put a non-json doc into a bucket.

That is, cbft seems to infinite loop trying to parse a document value as JSON when it isn't JSON.

The current theory is that when bleve Batch ingest encounters an error on some document in the batch (during mapDocument()), then bleve Batch() will return an error. (But we don't know which exact doc in the batch had an error.) This error in turn propagates back to cbft and then in turn back to cbdatasource, which in turn shuts down its DCP connection, and restarts the entire DCP stream.

But, the DCP stream (correctly) just restarts where it left off (re-streaming over the same problematic docs). So the error repeats.

cbft might need to use a lower-level bleve API instead of the easy-to-use high-level "porcelain" bleve Batch API, so that cbft can detect that there's a document mapping / character-analyzer error and just ignore that document in the batch. (cbft also probably needs to track and expose a "side area" of errors for things like document parsing/analysis errors, like "these are some of the docs which recently couldn't be indexed and the related error info".)

In more detail, the actual error appears like this...

2015/02/24 21:39:47 feed_dcp: on error: intrapop-leveldb.1_4fc326d5c2fec085: error: HandleRecv, err: invalid character 'h' looking for beginning of value

The document was just the (non-JSON) bytes of "hello".

steveyen commented 9 years ago

looks like updating to bleve's new Batch.Index API fixed this