SciLifeLab / facs

Fast and Accurate Classification of Sequences using Bloom filters
16 stars 9 forks source link

FACS has problems in MAC OsX #122

Open guillermo-carrasco opened 10 years ago

guillermo-carrasco commented 10 years ago

Something must be wrong either with the bloom filter construction or with the query method in Mac OsX, as it is returning 0 and nan on every query:

{"begin_timestamp": "2014-04-02T10:48:14.914+0200",
"end_timestamp": "2014-04-02T10:48:14.914+0200",
"sample": "/Users/guillem/repos/facs/tests/data/synthetic_fastq/simngs_phiX_1000.fastq",
"bloom_filter": "/Users/guillem/repos/facs/tests/data/bloom/phiX.bloom",
"total_read_count": 0,
"contaminated_reads": 0,
"total_hits": 0,
"contamination_rate": nan,
"p_value": nan,
"threads": 0}
brainstorm commented 10 years ago

We should add assert total_reads != 0. No dataset contains 0 reads ATM.

brainstorm commented 10 years ago

I have went through query.c but could not figure out what's wrong, until I compared the actual bloom files on each system...

Sizes of bloom filters on mac and x86 respectively:

-rw-r--r--  1 roman  staff  217529788 28 Okt  2013 /Users/roman/dev/facs/tests/data/bloom/dm3.bloom
-rw-r----- 1 roman roman 217529788 May 18  2013 tests/data/bloom/dm3.bloom

Contents vary though:

MD5 (/Users/roman/dev/facs/tests/data/bloom/dm3.bloom) = 2a3d92277a675516c5d9efc470b84862
1d50e7f9e1170b6bff2d99f96a585b40  tests/data/bloom/dm3.bloom

So now I would put my bets on something going on wrong while building the bloom filter on OSX...

brainstorm commented 10 years ago

A quick hexdump inspection reveals the problem:


0000000   45056   06099   00001   00000   40860   03447   00001   00000
0000010   65194   01875   00000   00000   60061   26553   00000   00000
0000020   00007   00000   00001   00000   13862   02626   00000   00000

On x86:

0000000   53264   17105   10952   00000   12384   14347   10952   00000
0000010   65194   01875   00000   00000   60061   26553   00000   00000
0000020   00007   00000   00000   00000   13862   02626   00000   00000

Therefore, header construction for .bloom files on OSX is wrong. Let's "hexamine" build.c then... :)

brainstorm commented 10 years ago

Sorry for the strange decimal output, it might make it hard to compare in hexa. Here's the canonical output from hexdump (hexdump -n100 -C dm3.bloom):


00000000  00 b0 d3 17 01 00 00 00  9c 9f 77 0d 01 00 00 00  |..........w.....|
00000010  aa fe 53 07 00 00 00 00  9d ea b9 67 00 00 00 00  |..S........g....|
00000020  07 00 00 00 01 00 00 00  26 36 42 0a 00 00 00 00  |........&6B.....|
00000030  7b 14 ae 47 e1 7a 74 3f  13 00 00 00 ab 00 00 00  |{..G.zt?........|
00000040  00 00 00 00 00 00 00 00  94 b0 a0 28 6a 3a 0b 80  |...........(j:..|
00000050  c4 05 f4 00 c9 10 a8 0d  96 52 74 4d 52 60 e1 4c  |.........RtMR`.L|
00000060  42 aa 61 02                                       |B.a.|

On x86:

00000000  10 d0 d1 42 c8 2a 00 00  60 30 0b 38 c8 2a 00 00  |...B.*..`0.8.*..|
00000010  aa fe 53 07 00 00 00 00  9d ea b9 67 00 00 00 00  |..S........g....|
00000020  07 00 00 00 00 00 00 00  26 36 42 0a 00 00 00 00  |........&6B.....|
00000030  7b 14 ae 47 e1 7a 74 3f  13 00 00 00 ab 00 00 00  |{..G.zt?........|
00000040  50 00 00 00 00 00 00 00  94 b0 a0 28 6a 3a 0b 80  |P..........(j:..|
00000050  c4 05 f4 00 c9 10 a8 0d  96 52 74 4d 52 60 e1 4c  |.........RtMR`.L|
00000060  42 aa 61 02                                       |B.a.|

Since those bloom filters were generated with the same testsuite, the parameters are the same, so there should be some bug on how OSX compiles/interprets numbers when building the filter... or a bad pointer when it reports the results...

brainstorm commented 10 years ago

Bindiffing both bloom files with radiff2 from @radare and running the test against the "correct" bloom filter returns same nans. Therefore, there could be something wrong with the reporting function...