NeuralEnsemble / elephant

Elephant is the Electrophysiology Analysis Toolkit
http://www.python-elephant.org
BSD 3-Clause "New" or "Revised" License
195 stars 92 forks source link

[Bug] Ineffective alpha-level for SPADE #620

Open Idavr opened 6 months ago

Idavr commented 6 months ago

Hello there! I am trying to identify patterns in my neural recordings I am now testing SPADE, which initially seems very promising for what we want to do, but I discovered that many of the patterns that were flagged as "non-significant" had p-values below my specified alpha-value in the parameters (alpha = 0.05). Is there something I am missing in how the statistical significance is being done and corrected for?

Best,

Idavr

Environment

Moritz-Alexander-Kern commented 6 months ago

Hello @Idavr , thanks for reaching out to us.

At first glance I agree with your assessment. However the SPADE module has a number functions and parameters.

Did you use the elephant.spade.spade function? According to the docs the filtering step depends on at least two parameters: alpha and approx_stab_pars.

See also: https://elephant.readthedocs.io/en/latest/reference/_toctree/spade/elephant.spade.spade.html#elephant.spade.spade

In a final step, concepts are filtered according to a stability threshold and a significance level alpha.

Did you use the elephant.spade.test_signature_significance function? https://elephant.readthedocs.io/en/latest/reference/_toctree/spade/elephant.spade.test_signature_significance.html#elephant.spade.test_signature_significance Which statistical correction did you use?

Bonferroni or FDR statistical corrections can be applied.

It could be very helpful, if you could provide a minimal code example or give more details on how you ran SPADE.

Hope this helps, unfortunately I'm not an expert on the method itself.

Moritz-Alexander-Kern commented 6 months ago

This might also be related to: https://github.com/NeuralEnsemble/elephant/issues/439 . This issue points out, that in particular the pattern filtering functions and SPADE in general require a more detailed documentation.

Idavr commented 6 months ago

Hi @Moritz-Alexander-Kern

Thanks for responding, apologies for the delay in getting back to you.

Using Pycharm I ran this specific line of code, where bins = 10, from inside my environment for elephant

patterns = spade(spiketrains, bin_size = bins*pq.ms, winlen=1, dither=5*pq.ms, spectrum='3d#', min_spikes=6, min_occ=3, min_neu=2, alpha=0.05, n_surr=1000, psr_param=[0,0,0]

The statistical correction used is fdr_bh and reading through the code, calling spade as I have done above should also run the test_signature_significance function, which is the one I have been reading through and trying to understand what here might be throwing things off.

However, the approx_stab_pars parameter I do not know of and cannot seem to find good documentation on how it might be used? Any further insight into how to better use this method is highly appreciated, from the small things I have gotten to work it seems to be exactly what I have been looking for.

Idavr commented 6 months ago

As a work around, I am trying to see how the same information can be stored for the non_significant patterns as well, seeing as they currently are only stored with their signature and p-value. Also storing the neuron numbers and time windows for the non-significant patterns would at least give me data to work with, regardless of the final statistical value.

Moritz-Alexander-Kern commented 6 months ago

Hey @Idavr ,

I created the following minimal code example to investigate this: (adapted from https://elephant.readthedocs.io/en/latest/tutorials/spade.html)

import quantities as pq
import elephant
import numpy as np

np.random.seed(4542)

# Generate spike trains with synchronous spikes
spiketrains = elephant.spike_train_generation.compound_poisson_process(
    rate=5 * pq.Hz,
    amplitude_distribution=[0] + [0.974] + [0] * 7 + [0.006] + [0.02],
    t_stop=10 * pq.s,
)
len(spiketrains)
# Generate random spike trains
spiketrains.extend(
    elephant.spike_train_generation.StationaryPoissonProcess(
        rate=5 * pq.Hz, t_stop=10 * pq.s
    ).generate_n_spiketrains(90)
)
# Set alpha
alpha = 0.05
# Mining patterns with spade
patterns = elephant.spade.spade(
    spiketrains,
    bin_size=10 * pq.ms,
    winlen=1,
    dither=5 * pq.ms,
    spectrum="3d#",
    min_spikes=6,
    min_occ=3,
    min_neu=2,
    alpha=alpha,
    n_surr=1000,
    psr_param=[0, 0, 0],
)
# Print result
print("Patterns dict:")
print(patterns)
# Print p-value spectrum
print("pvalue_spectrum: [size, number of occurrences, duration, p-value]")
print(patterns["pvalue_spectrum"])
# Print non-significant signatures
print("Non-significant signatures of ‘pvalue_spectrum’: [size, number of occurrences, duration]")
print(patterns["non_sgnf_sgnt"])
# Find p-values for non-significant signatures and print:
for non_sgnf_sgnt in patterns["non_sgnf_sgnt"]:
    for pvalue_spectrum in patterns["pvalue_spectrum"]:
        if pvalue_spectrum[:3] == [int(i) for i in non_sgnf_sgnt]:
            print(
                f"Non-significant signature:{non_sgnf_sgnt}, p-value:{pvalue_spectrum[3]}"
            )

which yields:

Time for data mining: 0.08771944046020508
Time for pvalue spectrum computation: 85.89613366127014
Patterns dict:
{'pvalue_spectrum': [[6, 3, 0, 0.992], [6, 4, 0, 0.776], [6, 5, 0, 0.255], [6, 6, 0, 0.044], [6, 7, 0, 0.007], [7, 3, 0, 0.891], [7, 4, 0, 0.336], [7, 5, 0, 0.048], [7, 6, 0, 0.004], [8, 3, 0, 0.504], [8, 4, 0, 0.056], [8, 5, 0, 0.001], [9, 3, 0, 0.075], [9, 4, 0, 0.002], [10, 3, 0, 0.001]], 'non_sgnf_sgnt': [(6.0, 3.0, 0.0), (6.0, 4.0, 0.0), (6.0, 5.0, 0.0), (6.0, 6.0, 0.0), (6.0, 7.0, 0.0), (7.0, 3.0, 0.0), (7.0, 4.0, 0.0), (7.0, 5.0, 0.0), (7.0, 6.0, 0.0), (8.0, 3.0, 0.0), (8.0, 4.0, 0.0), (8.0, 5.0, 0.0), (9.0, 3.0, 0.0), (9.0, 4.0, 0.0), (10.0, 3.0, 0.0)], 'patterns': [{'itemset': (4, 1, 9, 0, 6, 5, 7, 8, 3), 'windows_ids': (6, 94, 177, 351, 494, 723, 942, 981), 'neurons': [4, 1, 9, 0, 6, 5, 7, 8, 3], 'lags': array([0., 0., 0., 0., 0., 0., 0., 0.]) * ms, 'times': array([  60.,  940., 1770., 3510., 4940., 7230., 9420., 9810.]) * ms, 'signature': (9, 8, 0), 'pvalue': 0.0}]}
pvalue_spectrum: [size, number of occurrences, duration, p-value]
[[6, 3, 0, 0.992], [6, 4, 0, 0.776], [6, 5, 0, 0.255], [6, 6, 0, 0.044], [6, 7, 0, 0.007], [7, 3, 0, 0.891], [7, 4, 0, 0.336], [7, 5, 0, 0.048], [7, 6, 0, 0.004], [8, 3, 0, 0.504], [8, 4, 0, 0.056], [8, 5, 0, 0.001], [9, 3, 0, 0.075], [9, 4, 0, 0.002], [10, 3, 0, 0.001]]
Non-significant signatures of ‘pvalue_spectrum’: [size, number of occurrences, duration]
[(6.0, 3.0, 0.0), (6.0, 4.0, 0.0), (6.0, 5.0, 0.0), (6.0, 6.0, 0.0), (6.0, 7.0, 0.0), (7.0, 3.0, 0.0), (7.0, 4.0, 0.0), (7.0, 5.0, 0.0), (7.0, 6.0, 0.0), (8.0, 3.0, 0.0), (8.0, 4.0, 0.0), (8.0, 5.0, 0.0), (9.0, 3.0, 0.0), (9.0, 4.0, 0.0), (10.0, 3.0, 0.0)]
Non-significant signature:(6.0, 3.0, 0.0), p-value:0.992
Non-significant signature:(6.0, 4.0, 0.0), p-value:0.776
Non-significant signature:(6.0, 5.0, 0.0), p-value:0.255
Non-significant signature:(6.0, 6.0, 0.0), p-value:0.044
Non-significant signature:(6.0, 7.0, 0.0), p-value:0.007
Non-significant signature:(7.0, 3.0, 0.0), p-value:0.891
Non-significant signature:(7.0, 4.0, 0.0), p-value:0.336
Non-significant signature:(7.0, 5.0, 0.0), p-value:0.048
Non-significant signature:(7.0, 6.0, 0.0), p-value:0.004
Non-significant signature:(8.0, 3.0, 0.0), p-value:0.504
Non-significant signature:(8.0, 4.0, 0.0), p-value:0.056
Non-significant signature:(8.0, 5.0, 0.0), p-value:0.001
Non-significant signature:(9.0, 3.0, 0.0), p-value:0.075
Non-significant signature:(9.0, 4.0, 0.0), p-value:0.002
Non-significant signature:(10.0, 3.0, 0.0), p-value:0.001

I understand your initial inquiry concerns why certain patterns are labeled as non-significant even though their p-values are lower than the specified threshold value for alpha ?

Upon examining the implementation, it appears that the p-values generated by the spade function, referred to as the "p-value spectrum," are uncorrected. Depending on parameter choices, these uncorrected p-values may not be used for the filtering.

If you prefer to manage the filtering manually using alpha as a threshold, you can set alpha to None and stat_corr to 'no'. This configuration will yield unfiltered results, allowing you to implement your own filtering criteria.

patterns = elephant.spade.spade(
    spiketrains,
    bin_size=10 * pq.ms,
    winlen=1,
    dither=5 * pq.ms,
    spectrum="3d#",
    min_spikes=6,
    min_occ=3,
    min_neu=2,
    alpha=None,
    n_surr=1000,
    psr_param=[0, 0, 0],
    stat_corr='no'
)

For the approx_stab_pars, see https://doi.org/10.1007/978-3-642-29892-9_7 for more information.

Idavr commented 6 months ago

Hi @Moritz-Alexander-Kern

Thanks for looking into this. I have previously run a test with stat_corr = 'no' and alpha = 0.05, but that resulted in 0 patterns pinging for significance and the non-significant patterns still had p-values below 0.05. Another test with alpha = 1 resulted in only significant patterns but the total number was 15870 patterns in a 7 second long snippet of the recording! The total number for significant + non-significant patterns for the same time interval has previously hovered around max 120 patterns.

Does this make sense to you? Based on your explanation and what I thought myself the p-values for the non-significant patterns should indeed be the non-corrected ones, so as I was trying to get at the non-corrected significant patterns I thought the stat_corr = 'no'parameter by itself would help. But according to you the alpha = None parameter AND the stat_corr = 'no' would be work more in line with my thinking, is that so? Or would it still explode the number of significant patterns like with alpha = 1 ?

Moritz-Alexander-Kern commented 6 months ago

Hi @Idavr ,

While I don't fully grasp the reasoning behind it, it seems plausible that using stat_corr='no' and alpha=None could be the closest approach to achieving the intended outcome.

Does this make sense to you?

Please, allow me some time to look into the code.

It seems the variation of those parameters can lead to behavior which is not obvious. I.e. there are different code execution paths for combinations of alpha and stat_corr. This is why the documentation needs clarification on those parameters, see #439 . Consider the docs for alpha: "The significance level of the hypothesis tests performed." If you look into the source code you will find if-else statements, that will bifurcate the control flow depending on the parameter choice and combination. In this context reasoning about the code in this way (inductive style) might not lead to clarification or further insight.

Idavr commented 5 months ago

Hi @Moritz-Alexander-Kern,

Thank you for looking into this, I am grasping more how this works now. I am currently running a larger analysis with those parameters and will let you know how it goes.

Appreciate your time and fingers crossed this solves my conundrum.

Idavr commented 5 months ago

Hi @Moritz-Alexander-Kern

Just wanted to update on the progress on this. I have let the analysis of 31 seconds of data run for over two weeks with the parameters specified above, but it looks like this resulted in it being locked in some kind of loop after computing the value spectrum as most of these two weeks of the code running has been after that processing step. This is way longer than any other parameter testing I have done (on a relatively powerful computer), and without any other reference points or outputs I will have to abort it soon if the parameters have actually pushed it into being caught in a loop. I cannot see from the code immediately that this should happen, and when time dragged on I was thinking it just took this long since it was going over and writing up all the identified patterns, but as time goes on the belief that this is the case is diminishing.

If you have any input or knowledge of what possibly is going on then any thoughts are appreciated.

Moritz-Alexander-Kern commented 5 months ago

Hi @Idavr ,

To date, I haven't come across a scenario where SPADE would become stuck in an endless loop. If that's the case here it would be great to find and fix this issue.

My understanding is that it's conceivable that suboptimal parameter choices could result in excessively lengthy runtimes. However, without insight into the code and the specifics of the machines you're using for your analysis, offering meaningful advice is challenging. If you could provide a minimal code snippet that replicates the problem, I'd be able to investigate further. As it stands, my assessments would be somewhat speculative.

Generally speaking, analysis workflows are often IO bound, so I agree with your assessment, that reading and writing to disk could be a possible bottleneck here.

Idavr commented 5 months ago

Hi @Moritz-Alexander-Kern,

I understand. I am looking into being able to share the data, but until then I am running additional tests. One of these being to confirm that the MPI is actually being utilized in this process, which I am now a bit uncertain about. In the spade.py script I put these lines all the way in the beginning with the rest of the imports:


from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank() 
print('Using MPI for parallelization!')
print('Number of MPI processes : {}'.format(size))

However, this is not being printed when I run the program. Neither is any other print arguments I inserted deeper into the script at any point where


if HAVE_MPI:  # pragma: no cover
        comm = MPI.COMM_WORLD  # create MPI communicator
        rank = comm.Get_rank()  # get rank of current MPI task
        size = comm.Get_size()  # Get the total number of MPI processes
    else:
        rank = 0

was written, making it say:

if HAVE_MPI:  # pragma: no cover
        comm = MPI.COMM_WORLD  # create MPI communicator
        rank = comm.Get_rank()  # get rank of current MPI task
        size = comm.Get_size()  # Get the total number of MPI processes
        print('Using MPI for parallelization!')
        print('Number of MPI processes : {}'.format(size))
    else:
        rank = 0
        print('No MPI.')

Still no print messages mentioning MPI being printed at all.

This worries me as I am unable to know whether MPI is activated or not. Have you encountered this before?

Moritz-Alexander-Kern commented 5 months ago

Hi @Idavr ,

It sounds like you're facing some challenges with ensuring MPI is correctly set up and utilized in your script. It's essential to confirm MPI's activation to ensure proper parallelization. This might also be the reason for your analysis getting stuck in a deadlock (race condition could also occur, this might be also be relevant for #627 ?).

From your description, it seems the print statements related to MPI aren't being executed, which indeed should not be the case. One reason might be the wrong indentation level for the else clause in your example, it should be:

if HAVE_MPI:  # pragma: no cover
        comm = MPI.COMM_WORLD  # create MPI communicator
        rank = comm.Get_rank()  # get rank of current MPI task
        size = comm.Get_size()  # Get the total number of MPI processes
        print('Using MPI for parallelization!')
        print('Number of MPI processes : {}'.format(size))
else:
        rank = 0
        print('No MPI.')

Here are a few additional points to consider:

MPI installation: Ensure that MPI is properly installed and configured on your system. Sometimes issues arise from mismatched versions or incomplete installations.

Did you follow the setup process described in the docs?: Elephant docs on using MPI: https://elephant.readthedocs.io/en/latest/install.html#mpi-support

Script Execution: When running your Python script with MPI, make sure you're executing it in a manner that MPI recognizes. Depending on your MPI implementation (e.g., Open MPI, MPICH), the command to run MPI-enabled Python scripts might differ. Typically, you'd use a command like mpiexec or mpirun followed by your Python script. i.e.

$ `mpiexec` -n 4 python spade.py

See also; https://mpi4py.readthedocs.io/en/stable/tutorial.html#running-python-scripts-with-mpi

mpi4py Documentation: If you've verified the above and still encounter issues, reaching out to the mpi4py maintainers or consulting their documentation might provide valuable insights. They may offer specific guidance or troubleshooting steps tailored to mpi4py usage.

https://mpi4py.readthedocs.io/en/stable/index.html

Hope this helps, feel free to reach out again.