ScrollPrize / vesuvius

Python library for accessing Vesuvius Challenge data
MIT License
13 stars 1 forks source link

The Scroll routine appears to return incorrect data #4

Open righthalfplane opened 1 month ago

righthalfplane commented 1 month ago

I ran a simple example -

import vesuvius scroll = vesuvius.Volume("Scroll1") img = scroll[1000,5000:5256,5000:5256] binary_file=open("file1p.raw", "wb") binary_file.write(img) binary_file.close()

and I did a histogram of "file1p.raw" -

4704 58 41 46 56 37 57 53 0 0 0 0 0 0 0 0 57 45 47 58 53 52 60 63 0 0 0 0 0 0 0 0 66 72 45 73 58 75 68 75 0 0 0 0 0 0 0 0 95 79 84 102 83 112 103 106 0 0 0 0 0 0 0 0 137 141 148 162 181 176 216 202 0 0 0 0 0 0 0 0 379 328 341 418 405 435 478 526 0 0 0 0 0 0 0 0 891 908 964 1006 1077 1134 1190 1247 0 0 0 0 0 0 0 0 1783 1716 1741 1722 1774 1857 1917 1919 0 0 0 0 0 0 0 0 1852 1982 1911 1806 1816 1811 1709 1614 0 0 0 0 0 0 0 0 1336 1248 1219 1072 1131 1067 987 969 0 0 0 0 0 0 0 0 608 577 530 532 490 510 459 440 0 0 0 0 0 0 0 0 272 246 250 230 225 207 192 170 0 0 0 0 0 0 0 0 113 134 96 109 97 99 96 81 0 0 0 0 0 0 0 0 64 50 47 43 41 33 47 38 0 0 0 0 0 0 0 0 26 27 29 31 28 22 28 28 0 0 0 0 0 0 0 0 12 14 26 18 14 19 13 243 0 0 0 0 0 0 0 0

note that only 128 non zero values appear with gaps of 8 zeros in the results.

The vesuvius-c routines show the same problem.

bostelk commented 1 week ago

The Zarr volume was normalized and quantized (16-bits to 8-bits) to reduce the size on disk. The zeros could be empty air that was clipped outside of the mean range representing other denser materials like papyrus. The precision loss hasn't affected ink detection or other downstream applications to my knowledge--however, retraining on the new dataset was required. **I'm paraphrasing from the Discord channel here.

So it's not an issue in how the API accesses the data. There are zeros in the original dataset too, but are less frequent.

Comparison

Here's a quick comparison between the two data sources in the region you highlighted.

comparison

righthalfplane commented 5 days ago

I do not know what your chart is showing. The returned data set has 65536 (256x256) pixels it - not 306912 points and it looks like the include plot. Screen Shot 2024-10-26 at 7 28 48 AM

righthalfplane commented 3 days ago

@bostelk - What version of python are you using ? On ubuntu 24.10, python2 would not completely install. I was able to the the install with python3, but only after putting in some fake links to python2 stuff. The data returned had the same holes as I found in the MacOS 12.7.3 version. Your histogram of the data returned looks very much like mine except that the holes are filled in and it has too many points.

bostelk commented 3 days ago

I'm sorry, my earlier comparison was a bad illustration. I had copied the image into an editor and the size was larger hence the wrong number of points and interpolated graph. I'm using Python 3.

I opened the volume in a different viewer (https://dl.ash2txt.org/view/Scroll1) and the issue (???) is apparent there too. So I don't think it's caused by the reader but rather in how the volume was created/converted from a higher-precision dataset. I don't have a concrete explanation, only speculation.

New Comparison

Figure_1

jrudolph commented 2 days ago

Good catch! It seems that for some reason the quantization process ended up setting (only?) bit 3 to zero, so that all resulting numbers look like xxxx0xxx. If the intention was to quantize to 4 bits, the data set ended up having more precision than intended (but is also somewhat misrepresenting the original data by setting a middle bit to zero which I'd say introduces some unintended bias by shifting the blanked values by 8 to the next lower bucket).

righthalfplane commented 2 days ago

I have been complaining about this problem for a month and I see in the general discussion that people are going to fix the problem and redo some work - will the fix get into the C and Python routines ?

stephenrparsons commented 2 days ago

Yes, and thanks for bringing attention to it. Those libraries pull from the same data source, so as soon as the volumes are updated on the server, both libraries will have the revised data. You may wish to clear the local cache, if you have one, to make sure you get the new versions.