caporaso-lab / tax-credit

A repository for storing code and data related to a systematic comparison of short read taxonomy assignment tools
BSD 3-Clause "New" or "Revised" License
16 stars 8 forks source link

How should we handle missing observations in taxonomy assignments during eval? #128

Open nbokulich opened 7 years ago

nbokulich commented 7 years ago

If a representative sequence is present in a mock community (and hence the expected taxonomy file) but is not classified by some taxonomy classifiers (ahem... current sklearn naive-bayes), the eval throws an error:

AssertionError in:
/Users/nbokulich/Desktop/projects/short-read-tax-assignment/data/precomputed-results/mock-community/mock-22/gg_13_8_otus_amplicon/naive-bayes/False:0.1
AssertionError: observed and expected read labels differ:
['fb6f70f872db41f24044c84b6f7a5844', 'f347ed507374d3ad78e59ce519190b59', 'a0748d07684e299b7bb3fdb265e5de4a', '1d77d81bf8e4fa24b525987257dd5396', '6b57b00d3724d584038b912d5462b1a3', 'bcc8ad1c4518db8cc3549258f5834d4a', 'a67bb5740043b6e08858fa4edbfc8ed3', '2e1911ec10887839e6a2aeaa06669e54', '574f20a3d7c71281fa40d64f27e6e22b', 'f134cd502119dcd22f8337ba5e89667f', '3f265921cb45d83b28ee286b154fc0d0', 'b53806d2798866e54bc88919793f0143', '5f24d62e8caaa259b59cb08f7c7756df', '83da98c6bc8521e9c791aeced185cb15', '333a40af36286f867e8e903e420016d4', '0c01877ee48a17ced09f7382bfdece9f', '071ccbf1a99312897fd4338f3aa15ee0', '47d01e6146062d4183e8b7db6a700475', 'f461aa828efb039d730b0e518c4f0aaa', '67e67721df6df05920b34c512038ccc0', 'fbd5df3b3d4c3a5432cb6f0cbd649ca1', 'e95f715d634bf05de7ea84afad3f9746', '1820e6e409a195ddef6a8d62d7b5242b', 'b7dacef20fe1f1c854ea226f7e157205', '94abfcc74e15148e81c3ecca47a75bed', 'b7172bfd271404e546936636e86cba72', 'c00fd53d00e8e2e34b114980b98e280a', '2d4978cf246f2ee60cd5ddba9df73e8e', '1bba9bc44ca57e41a7fa5b96fcf4a792', 'c551e3a305d51c1a3c56057d4b57d42a', 'b5bc607a490a5a0e97594a97e5e3820b', 'aa920529d931f31b07482715d415ac65', '27f00823dd0cf2a0b7fe15cd55c2b817', '3510d5121df9f091d80df9d53c460170', 'bbc6a536ea3427b5eeaa008fd43ff261', 'f87fe98e3a70e14b5fa41cbe36d56d48', 'a7ac04fb6dae5540438d7f5f9b626ca6', 'c9bf5c74d65804a3c12ee5f33a35ce53', '3295b7c2f9a047b89d14a6aff28d25da', '1b736c9b2516e440141859f406d9a754', '429c87d38cccaced346f5669cd76dd9c', 'bd0e04bba322ccdb0aeddbdc43ce1798', 'd03173933344f7959edc8d5fa2178b4b', '2c5352f7ad9c99e93424ced5793b9f83', '761f6e7e683272bb63343f32497e5140', '220c846f12fd2f209cec1f6869147a6d', '2fe0d49dde617cc794a5eac8643a224c', '867a6a47b7304418cb43fb4a2e30d3d0', '42370bee07a419d0bf08a77c71eb71b2', '077aa1b236fc782e8ec5d52d38d2217a', '8dbf4b0cb48736c8a406f28dfa64341f', '88c84411a36fcd85de778f674566a98f', '13193b19dd227e21a2b4404a93ef11a5', 'b8cc728644d1ee15082bc05ed841134a', 'c44d66bffeacdbdfaf774e80a63a314c', 'c4dc03896e8c65d0bbe4f472fbfda5f2', '675901e598409fba2ca88494be3463d3', '75d67bcc37ff6bc3b89fce1ee8f1c372', '2077be704a8a81f1588f95a3b09504c3', 'ca916522eb03925008ed1a573fccea4d', 'eac49ef763ca5d9009fa065899351d50', '78274e9a3a4f55de2e92f6e5d89a4865', '00c46ec40d6e5ff13c931ab695422809', '01fdabbd95d79a1883887f5c64967e3b', 'd1af4f8446d8a41975234d96ac7c51e8', '4669f785716df43aa86c9a990e619eb1', '1c3e865d1fc5e0a8a0f78c8104960cca', '383d67c7f618436f9731556d5148b4bb', 'd6aa78db03f112595ccf3de6e632a776', 'b85a7e5c0f73a4d651fa3e6f082ec9dd', 'b98513444f750317f0dfff492b9d2cdc', 'bbb2c82af14dff4b79b854e3e7a2d86a', '0da786d0c2fa20926b44660a9ab2e8f0', '0493483cc66975024c927c2356f762d9', '9ca45f68f3c2a26d26b68740a86ad5c7', '947509aeb83689ba95645303eba19e9c']
['f347ed507374d3ad78e59ce519190b59', '598c0d78a56f4f34186a60a2096c8cbb', '1d77d81bf8e4fa24b525987257dd5396', '6b57b00d3724d584038b912d5462b1a3', '071ccbf1a99312897fd4338f3aa15ee0', 'afb70b875bdbfe8eb05036c6761f2949', 'a67bb5740043b6e08858fa4edbfc8ed3', '2e1911ec10887839e6a2aeaa06669e54', '574f20a3d7c71281fa40d64f27e6e22b', '75d67bcc37ff6bc3b89fce1ee8f1c372', '83da98c6bc8521e9c791aeced185cb15', '3f265921cb45d83b28ee286b154fc0d0', 'f87fe98e3a70e14b5fa41cbe36d56d48', 'b53806d2798866e54bc88919793f0143', 'f461aa828efb039d730b0e518c4f0aaa', 'a7ac04fb6dae5540438d7f5f9b626ca6', '5f24d62e8caaa259b59cb08f7c7756df', '9ca45f68f3c2a26d26b68740a86ad5c7', '333a40af36286f867e8e903e420016d4', 'a0748d07684e299b7bb3fdb265e5de4a', '220c846f12fd2f209cec1f6869147a6d', 'f134cd502119dcd22f8337ba5e89667f', '0fa12d8ac45f660abaf5ad66ba640700', 'b85a7e5c0f73a4d651fa3e6f082ec9dd', '0493483cc66975024c927c2356f762d9', 'e17749b10816dd2810a5c28640cea363', '67e67721df6df05920b34c512038ccc0', 'fbd5df3b3d4c3a5432cb6f0cbd649ca1', 'e95f715d634bf05de7ea84afad3f9746', '1820e6e409a195ddef6a8d62d7b5242b', 'b7dacef20fe1f1c854ea226f7e157205', 'c551e3a305d51c1a3c56057d4b57d42a', 'b7172bfd271404e546936636e86cba72', 'c00fd53d00e8e2e34b114980b98e280a', '2d4978cf246f2ee60cd5ddba9df73e8e', '47d01e6146062d4183e8b7db6a700475', 'fb6f70f872db41f24044c84b6f7a5844', '42250b23bbcf1e5ed9dd2406606994c3', 'b5bc607a490a5a0e97594a97e5e3820b', '27f00823dd0cf2a0b7fe15cd55c2b817', '3510d5121df9f091d80df9d53c460170', 'bbc6a536ea3427b5eeaa008fd43ff261', 'd6aa78db03f112595ccf3de6e632a776', 'c9bf5c74d65804a3c12ee5f33a35ce53', '3295b7c2f9a047b89d14a6aff28d25da', '1b736c9b2516e440141859f406d9a754', '08c616332ec73f76695bda6d1e8c1883', '0c01877ee48a17ced09f7382bfdece9f', '429c87d38cccaced346f5669cd76dd9c', 'bd0e04bba322ccdb0aeddbdc43ce1798', 'd03173933344f7959edc8d5fa2178b4b', '2c5352f7ad9c99e93424ced5793b9f83', '761f6e7e683272bb63343f32497e5140', '4aa1826b4bcccd4d1d8eae11fa66d439', '2fe0d49dde617cc794a5eac8643a224c', '867a6a47b7304418cb43fb4a2e30d3d0', '3091d80f9a7330328732e0631ea8c41d', '42370bee07a419d0bf08a77c71eb71b2', '1bba9bc44ca57e41a7fa5b96fcf4a792', '077aa1b236fc782e8ec5d52d38d2217a', '8dbf4b0cb48736c8a406f28dfa64341f', '88c84411a36fcd85de778f674566a98f', '13193b19dd227e21a2b4404a93ef11a5', 'b8cc728644d1ee15082bc05ed841134a', 'c44d66bffeacdbdfaf774e80a63a314c', 'c4dc03896e8c65d0bbe4f472fbfda5f2', '675901e598409fba2ca88494be3463d3', '2077be704a8a81f1588f95a3b09504c3', 'ca916522eb03925008ed1a573fccea4d', 'eac49ef763ca5d9009fa065899351d50', '4669f785716df43aa86c9a990e619eb1', '00c46ec40d6e5ff13c931ab695422809', '01fdabbd95d79a1883887f5c64967e3b', 'd1af4f8446d8a41975234d96ac7c51e8', '94abfcc74e15148e81c3ecca47a75bed', '1c3e865d1fc5e0a8a0f78c8104960cca', '383d67c7f618436f9731556d5148b4bb', '78274e9a3a4f55de2e92f6e5d89a4865', 'b98513444f750317f0dfff492b9d2cdc', 'bbb2c82af14dff4b79b854e3e7a2d86a', '6404c9d54337d9a3f5b3fb365af3671f', '0da786d0c2fa20926b44660a9ab2e8f0', 'bcc8ad1c4518db8cc3549258f5834d4a', 'aa920529d931f31b07482715d415ac65', '947509aeb83689ba95645303eba19e9c']

We want this behavior because because we want to catch scenarios where, e.g., classification failed and empty files are output, or other silly things like that that do happen.

The ideal solution is to enforce that classifiers report unassigned observations, so that there is no ambiguity about the classification of this observation or if/where this observation was missing.

However, this could be a wall for some, e.g., if a non-developer of classifier X wants to include it in an evaluation.

Any ideas on how to resolve? E.g., allow override of this AssertionError?

nbokulich commented 7 years ago

@BenKaehler this issue is solved, correct? pls confirm and close if it is.