broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
281 stars 50 forks source link

cannot convert float NaN to integer #62

Closed DanieleMuraro closed 4 years ago

DanieleMuraro commented 4 years ago

To whom it may concern,

I run cellbender on a scRNA-Seq+CRISPR dataset derived from iPSCs. I used the cellranger output in the folder filtered_feature_bc_matrix as an input for cellbender. The cellranger output includes both Gene Expression and CRISPR Guide Capture; so, the features.tsv file looks as follows:

...
ENSG00000275063 AC233755.1  Gene Expression
ENSG00000271254 AC240274.1  Gene Expression
ENSG00000277475 AC213203.1  Gene Expression
ENSG00000268674 FAM231C Gene Expression
TREM2-1 TREM2-1 CRISPR Guide Capture
TREM2-2 TREM2-2 CRISPR Guide Capture
TREM2-3 TREM2-3 CRISPR Guide Capture
NEG_CTRL-1  NEG_CTRL-1  CRISPR Guide Capture
NEG_CTRL-2  NEG_CTRL-2  CRISPR Guide Capture
NEG_CTRL-3  NEG_CTRL-3  CRISPR Guide Capture

I renamed features.tsv as gene.tsv, to maintain the format reported in the documentation:

cellbender doc

I then run the command:

cellbender remove-background \
     --input ./filtered_feature_bc_matrix \
     --output ./erica_ipcs.h5

This led to the output:

(CellBender) MacBook-Pro-4:Miseq_10x_iPSC_082019 daniele$ cat out.run_cellbender 
cellbender:remove-background: Command:
cellbender remove-background --input ./filtered_feature_bc_matrix --output ./erica_ipcs.h5
cellbender:remove-background: 2020-06-25 12:31:58
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from directory ./filtered_feature_bc_matrix
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
/Users/daniele/anaconda3/envs/CellBender/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/daniele/anaconda3/envs/CellBender/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/Users/daniele/anaconda3/envs/CellBender/bin/cellbender", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 92, in run
    main(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 185, in main
    run_remove_background(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 90, in __init__
    gene_blacklist=gene_blacklist)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 205, in _trim_dataset_for_analysis
    get_d_priors_from_dataset(self)  # After gene trimming
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 1064, in get_d_priors_from_dataset
    cell_counts = int(np.expm1(cell_log_counts).item())
ValueError: cannot convert float NaN to integer

Could you please help me understand what is the problem?

Thank you for your attention.

With best wishes,

Daniele Muraro

mbabadi commented 4 years ago

Dear @DanieleMuraro,

Thank you for using CellBender!

It looks like an input data inconsistency. Can you assert the consistency between the shape of the input count matrix and the barcodes and genes tsv files? also, can you assert that your count matrix does not include any bad values? (e.g. all non-negative integers)?

DanieleMuraro commented 4 years ago

Dear @mbabadi,

Thank you very much for your kind reply. The count matrix does not include negative or NA values. As regards the dimensions, the seem to be consistent:

(base) MacBook-Pro-4:filtered_feature_bc_matrix daniele$ wc -l barcodes.tsv
    1784 barcodes.tsv
(base) MacBook-Pro-4:filtered_feature_bc_matrix daniele$ wc -l genes.tsv 
   33544 genes.tsv
(base) MacBook-Pro-4:filtered_feature_bc_matrix daniele$ head -n 5 matrix.mtx
%%MatrixMarket matrix coordinate integer general
%metadata_json: {"format_version": 2, "software_version": "3.1.0"}
33544 1784 1987712
33543 1 10
33509 1 13

Does the package work with CRISPR guides? Thanks again for your help,

Daniele

sjfleming commented 4 years ago

Hi @DanieleMuraro, I think I see what the issue is. Two things:

  1. The tool makes use of the information in the empty droplets, so you will need to use the raw_feature_bc_matrix folder as the input rather than the filtered_feature_bc_matrix. But, even better, try using the raw_feature_bc_matrix.h5 file.

  2. The documentation that you referenced did not keep up with the changes to the output filenames for CellRanger v3... while that link says it expects genes.tsv, that is only for CellRanger v2 outputs. If you have CellRanger v3 outputs, then the tool will accept the features.tsv.gz file without needing to rename it.

But, if you have access to the raw_feature_bc_matrix.h5 file, it might be easier to use that as the input. Let me know if that works!

(What I think is causing the error: since the tool finds the file called genes.tsv, it assumes it is dealing with CellRanger v2 outputs. But since the input is really CellRanger v3, it ends up looking for data in the wrong place.)

sjfleming commented 4 years ago

As for the CRISPR guides, that's a great question! In the current version 0.1 of CellBender remove-background, the tool only looks at the features that are denoted as "Gene Expression". However, the mathematical model is equally good for other types of data, including "Antibody Capture". Until you mentioned it, I did not realize that 10x had made "CRISPR Guide Capture" an option, but that is a really cool idea.

So I'll venture a few guesses: if the CRISPR guides are subject to the same types of noise (ambient and swapping) that we mention in our paper (https://www.biorxiv.org/content/10.1101/791699v1), then I would expect the model / tool to perform well on the CRISPR guide counts. Can the CRISPR guides become cell-free ambient? Is that a large source of background counts? I haven't had the chance to explore a dataset with CRISPR guides yet.

In the branch called sf_removebkg_v2.1, which is a semi-stable development version that we are working on, all of the "features" are kept, not just Gene Expression. So if you try out that branch of the code, it will run on your CRISPR guide data as well. We expect to develop this branch into the next official release, accompanied by a publication.

If you do try to run remove-background on your CRISPR guide data, I would love to see how it looks!

DanieleMuraro commented 4 years ago

Hi @sjfleming,

Thank you very much for taking the time of getting back to me. I managed to run cellbender using raw_feature_bc_matrix.h5; thank you so much! :-) I share the output plots obtained when running cellbender on the same data mentioned in my previous posts using the cellbender master version (looks at the features that are denoted as "Gene Expression" only) and when applying the branch called sf_removebkg_v2.1 (where all of the "features" are kept, not just Gene Expression). The UMI curve shows a cell probability trend similar to a step function using the master version; whereas it shows few peaks in cell probability in the area where most barcodes are associated with background using the sf_removebkg_v2.1 version. I am not sure why this happens. Thanks again for your help! erica_ipcs_cellbender_sf_removebkg_v2.1.pdf erica_ipcs_cellbender_master.pdf

sjfleming commented 4 years ago

Hi @DanieleMuraro,

Glad you got the code to run! And good to hear that all the features are included when using the v2.1 branch of the code.

You are right... there are a few droplets way out there (which are obviously empty) where for some reason v2.1 seems to think they have some probability of having a cell. Maybe 5 of those droplets look like they would pass the > 0.5 cell probability threshold that is used to generate the "_filtered.h5" output file.

  1. I would definitely always suggest a cell QC step after CellBender anyway... I usually use a filter of something like at least 100 genes should be expressed, and some cutoffs on % mitochondrial reads (depending on how the dataset is generated) and a few other metrics. This will definitely eliminate those droplets in your case.
  2. I will be sure to look into this, and we'll try to make some tweaks that will eliminate those outliers...