broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
285 stars 52 forks source link

Fix posterior and estimator integer overflow bugs on Windows #259

Closed sjfleming closed 11 months ago

sjfleming commented 1 year ago

The data structure used to store the full posterior is a [m, c] matrix, where m is n_barcodes times n_genes. So m can be absolutely massive.

When m > maximum possible int32:

Including explicit casts to uint64 seems to solve the problem on Windows (pending input on #252 ).

I have included a test which I believe might fail on Windows without this fix, though I do not have a Windows setup to test on yet.

ps2504 commented 1 year ago

Hello,

I made the two code changes in posterior.py, however, I still get the exact same error. I'm not sure how to run the "test" using the files within the test folder...

Thanks, Paul

cellbender:remove-background: Computing target noise counts per gene for MCKP estimator Traceback (most recent call last): File "C:\Users\SuP1.conda\envs\cellbender\Scripts\cellbender-script.py", line 33, in sys.exit(load_entry_point('cellbender', 'console_scripts', 'cellbender')()) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\base_cli.py", line 123, in main cli_dict[args.tool].run(args) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\cli.py", line 185, in run return main(args) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\cli.py", line 230, in main posterior = run_remove_background(args) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\run.py", line 133, in run_remove_background file_name=file_name, File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\run.py", line 237, in compute_output_denoised_counts_reports_metrics per_gene=True, File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\posterior.py", line 1579, in compute_mean_target_removal_as_function device=device, File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\estimation.py", line 150, in estimate_noise dtype=np.float32) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\estimation.py", line 77, in _estimation_array_to_csr dtype=dtype, File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\estimation.py", line 238, in _estimation_array_to_csr row, col = index_converter.get_ng_indices(m_inds=m) File "c:\users\sup1\downloads\cellbender-0.3.0\cellbender-0.3.0\cellbender\remove_background\posterior.py", line 1537, in get_ng_indices raise ValueError(f'Requested m_inds out of range: ' ValueError: Requested m_inds out of range: [-2147483648 -2147483648 -2147483648 ... -2147483648 -2147483648 -2147483648]

sjfleming commented 1 year ago

@ps2504 it looks like you have cloned the github repository to your local machine, so all you have to do to run the relevant tests would be (in the folder where the cellbender code is located)

git checkout sf_posterior_int_overflow
pip install pytest
pytest -k test_save_and_load cellbender/remove_background/tests/test_posterior.py

if tests pass, you'd see an output like this

=================================================== test session starts ===================================================
platform darwin -- Python 3.7.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /Users/sfleming/Documents/Github/CellBender
collected 76 items / 72 deselected / 4 selected                                                                           

cellbender/remove_background/tests/test_posterior.py ....                                                           [100%]

============================================ 4 passed, 72 deselected in 0.19s =============================================

But based on what you said about getting the same error, I'd be expecting the test to fail. I will try to add a Windows configuration to the automatic tests that run in github.

ps2504 commented 1 year ago

Hello again,

Okay I was able to run the test per your instructions: INTERESTINGLY, the test looks like there's no issues. but I went back and reran the code again but got the same error

(cellbender) C:\Users\SuP1\Downloads\CellBender>pytest -k test_save_and_load cellbender/remove_background/tests/test_posterior.py =========================== test session starts ============================ platform win32 -- Python 3.7.16, pytest-7.4.0, pluggy-1.2.0 rootdir: C:\Users\SuP1\Downloads\CellBender plugins: anyio-3.5.0 collected 76 items / 72 deselected / 4 selected

cellbender\remove_background\tests\test_posterior.py .... [100%]

===================== 4 passed, 72 deselected in 0.18s =====================

ps2504 commented 1 year ago

The last call back error was in posterior.py, but the one prior is in estimation.py. I have no idea but could that have something to do with it?

Thank you so much for looking into it.

ps2504 commented 1 year ago

I pasted the error into chatgpt and this is what it returned... I hope this might be useful?

The error message you provided indicates that there is an issue with the input data, specifically with the indices provided in the m_inds array. It seems like the values in the m_inds array are set to -2147483648, which is the minimum value for a 32-bit signed integer (int32). This value is often used as a sentinel value or a placeholder in various programming contexts.

Here are a few things you can check and consider to diagnose and address this issue:

Index Range: The error message suggests that the values in m_inds are out of range. Check if the indices you're using are within the valid range for the data structure or function you're using. Make sure that the indices are non-negative and within the appropriate range for the specific operation you're performing.

Data Type: Ensure that the data type of the m_inds array is correct. If the data type is incorrectly specified or if there's a type mismatch, it can lead to unexpected behavior or errors. For example, using a data type that is too small to represent the indices can result in overflow or wrapping of values.

Initialization: If the m_inds array is being initialized with default values, such as -2147483648, it might indicate an issue with the way the array is being created or populated. Make sure that the array is being initialized correctly and that the values are being assigned as intended.

Data Integrity: Check the source of the data you're using to populate the m_inds array. It's possible that there's a bug or a data preprocessing issue that's causing incorrect values to be assigned to the array.

Library or Framework Specifics: If you're using a specific library or framework for numerical computations, consult its documentation to understand the expected input formats and data ranges. Different libraries might have different requirements for input data.

Data Exploration: Before passing the data to the function that's causing the error, print out the contents of the m_inds array to see if the values are indeed what you expect them to be. This can help you identify any inconsistencies or unexpected values.

Without more context about the specific code you're working with and the function that's causing the error, it's a bit challenging to provide a more precise solution. However, I hope these general suggestions help you identify and resolve the issue you're facing.

ps2504 commented 1 year ago

Hi. Just wanted to update that the most recent changes you made (Make dtype of m explicit in apply_function_dense_chunks) seems to have done the trick.

Thanks a bunch!

sjfleming commented 1 year ago

@ps2504 wonderful! Great news!