colour-science / colour

Colour Science for Python
https://www.colour-science.org
BSD 3-Clause "New" or "Revised" License
2.08k stars 259 forks source link

[BUG]: Extremely slow or deadlock in initialization of MultiSpectralDistributions #1292

Closed peroveh closed 6 days ago

peroveh commented 1 week ago

Description

I am working on large sets of multispectral data (Zhang_41 million reflectances) which comes in zips of > 10 million spectra at 31 wavelengths. I thought i could use the MultiSpectral distribution to contain the total spectra from a light, all reflectances, and convert this to XYZ, but it fails already in the initialization of the object if i have higher than 1000 ish vectors.. (3 sec timeout)... Original spectral data are huge files (link below) data= scipy.io.loadmat('Ref5_10_unique_expand.mat') # loads from matlab file #https://www2.cs.sfu.ca/~colour/data/Zhang_41MillionReflectances/ Below i test with a dummy data array of size 31 by 100 and subsequently try 1000 (i need millions)

With size (31,1000).. it takses a lot of cpu, time and gives warnings Should not this datatype large datasets permitting fast vectorized conversion to XYZ for example?

Code for Reproduction

wavelengths = np.arange(400,710,10)
data = np.matmul(np.arange(0,31,1).reshape(31,1),np.arange(0,100,1).reshape(1,100)) 
# make float32 for size of np array and transpose
mspect = colour.MultiSpectralDistributions(data,domain=np.float32(wavelengths))
plt.plot(spect.wavelengths,spect.values) # plots out hundres spectra  
#works, but slow, and i cannot wait for hours.. Also i do not understand what type of computation it is doing

Exception Message

Evaluating: spect = colour.MultiSpectralDistributions(data,domain=np.float32(wavelengths)) did not finish after 3.00 seconds.
This may mean a number of things:
- This evaluation is really slow and this is expected.
    In this case it's possible to silence this error by raising the timeout, setting the
    PYDEVD_WARN_EVALUATION_TIMEOUT environment variable to a bigger value.

- The evaluation may need other threads running while it's running:
    In this case, it's possible to set the PYDEVD_UNBLOCK_THREADS_TIMEOUT
    environment variable so that if after a given timeout an evaluation doesn't finish,
    other threads are unblocked or you can manually resume all threads.

    Alternatively, it's also possible to skip breaking on a particular thread by setting a
    `pydev_do_not_trace = True` attribute in the related threading.Thread instance
    (if some thread should always be running and no breakpoints are expected to be hit in it).

- The evaluation is deadlocked:
    In this case you may set the PYDEVD_THREAD_DUMP_ON_WARN_EVALUATION_TIMEOUT
    environment variable to true so that a thread dump is shown along with this message and
    optionally, set the PYDEVD_INTERRUPT_THREAD_TIMEOUT to some value so that the debugger
    tries to interrupt the evaluation (if possible) when this happens.

Environment Information

*       networkx : 2.8.8                                                      *
*       numpy : 1.21.5                                                        *
*       pandas : 1.5.1                                                        *
*       scipy : 1.8.0                                                         *
*       sklearn : 1.3.0                                                       *
*       tqdm : 4.64.1                                                         *
*                                                                             *
===============================================================================
defaultdict(<class 'dict'>, {'Interpreter': {'python': '3.8.13 (default, Mar 28 2022, 11:38:47) \n[GCC 7.5.0]'}, 'colour-science.org': {'colour': '0.4.1'}, 'Runtime': {'imageio': '2.22.4', 'matplotlib': '3.5.2', 'networkx': '2.8.8', 'numpy': '1.21.5', 'pandas': '1.5.1', 'scipy': '1.8.0', 'sklearn': '1.3.0', 'tqdm': '4.64.1'}})
KelSolaar commented 1 week ago

Hello,

This is not a defect and we probably ought to update the documentation in that regard but you should not be using colour.MultiSpectralDistributions in that instance but a numpy.NDArray instead.

Under the hood, colour.MultiSpectralDistributions uses instances of colour.SpectralDistribution which in turn is derived from colour.continuous.Signal class which is an interpolator so that one can sample the spectrum at any arbitrary point. This offers a lot of flexibility for many things but it is, given the above, going to be heavy when using thousands or more spectra because of initialisation time and memory cost.

In the situation where you are, an array is required and you will find out that the colour.msds_to_XYZ definition actually does work with them! We use it to perform hyperspectral image integration. You certainly lose quite a lot of flexibility in term of manipulation of the spectra but it allows manipulation of much larger datasets.

Cheers,

Thomas