10XGenomics / vartrix

Single-Cell Genotyping Tool
MIT License
185 stars 27 forks source link

Cell number differ between input and output #54

Closed MagpiePKU closed 3 years ago

MagpiePKU commented 3 years ago

Hi,

We ran vartrix by: $vartrix_dir/vartrix_linux --cell-barcodes $input_barcode --fasta /gpfs/genomedb/cellranger/refdata-cellranger-atac-GRCh38-1.2.0/fasta/genome.fa --bam $input_bam --vcf $output_name.candidate.snp.hg38.vcf --threads nproc --scoring-method coverage --out-matrix $output_name.alt.mtx --ref-matrix $output_name.ref.mtx

where as the input is a cellranger-atac produced possorted_bam.bam and the file outs/filtered_peak_bc_matrix/barcodes.tsv is provided as barcode

We found that the output matrix always contains 1 cell less compared to the barcode file, which is very confusing. This seemed to be specific scATAC output issue.

Thanks a lot in ahead

pmarks commented 3 years ago

Thanks for the report -- one thing that would be helpful in debugging: is it always the last barcode in the barcode.tsv file that's missing, or is it a random barcode?

On Thu, Jan 7, 2021 at 9:20 PM MagpiePKU notifications@github.com wrote:

Hi,

We ran vartrix by: $vartrix_dir/vartrix_linux --cell-barcodes $input_barcode --fasta /gpfs/genomedb/cellranger/refdata-cellranger-atac-GRCh38-1.2.0/fasta/genome.fa --bam $input_bam --vcf $output_name.candidate.snp.hg38.vcf --threads nproc --scoring-method coverage --out-matrix $output_name.alt.mtx --ref-matrix $output_name.ref.mtx

where as the input is a cellranger-atac produced possorted_bam.bam and the file outs/filtered_peak_bc_matrix/barcodes.tsv is provided as barcode

We found that the output matrix always contains 1 cell less compared to the barcode file, which is very confusing. This seemed to be specific scATAC output issue.

Thanks a lot in ahead

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/10XGenomics/vartrix/issues/54, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAALGA6YIUJY2ACLGK2JXOLSY2I3JANCNFSM4VZ6TEFA .

-- Patrick Marks Senior Director, Computational Biology patrick@10xgenomics.com name@10xgenomics.com [image: 10x Genomics] http://www.10xgenomics.com/ Office 925 123 4567 | Mobile 408 123 4567 6230 Stoneridge Mall Road Pleasanton, CA 94588-3260 | 10xgenomics.com http://www.10xgenomics.com/

pmarks commented 3 years ago

There's a log message emitted at the start of the run that say "Loaded X barcodes" - does that contain the right number?

If not, there are probably duplicate barcodes in the tsv file. This could be due to a bug in cellranger-atac, or some issue in post-processing the barcode list.

MagpiePKU commented 3 years ago

Thanks for the reply. We did not see any log output during the execution (that is another thing that add to the problem).

We tested various bam files and found that the output cell number varied (but always less than 100%). We tried input VCF not only containing the somatic input but also the germline variants (which should be common) and it seemed that the cell numbers were correct then. Seems like it is actively removing cells such that the last column of matrix should not sum up to 0.

omansn commented 3 years ago

Hi, I just wanted to chime in and say that this is also an issue for me. I've run into this problem a few times, but cannot figure out what is happening. If I set the logging to debug mode, it seems like vartrix isn't reading all of the barcodes in the barcode.tsv file. For example, I have a list of 29065 unique cell barcodes and vartrix debug reports

02:18:07 [DEBUG] vartrix: Loaded 28752 barcodes

It would be hard to know which of these barcodes are being used without some serious regex of the debug output. I think it would be very helpful if there was an option to output barcodes just as it outputs variants with --out_variants. I always feel a little iffy about assuming the barcodes are in the order of my input file.

pmarks commented 3 years ago

@MagpiePKU my apologies, if you run with --log-level debug you'll get a message right at the start about how many barcodes were loaded.

@omansn or @MagpiePKU I'm perplexed how the number of barcodes loaded can be less than the number of unique barcodes in the file. The code that loads the barcodes is very simple. Would either of you be able to share the barcodes files you've been using?

omansn commented 3 years ago

Hi Patrick, Thanks for the fast reply. Here is one barcode file that gives issue. All barcodes end in -1 and all are unique. Vartrix loads ~100 fewer than what is in the file (sorry I don't have the exact number on hand). I changed the .tsv extension to .txt because that is what github wanted.

barcodes.txt

pmarks commented 3 years ago

@omansn - I'm getting the full number of barcodes in the initial log message and in the output matrix using your barcode list. I'm perplexed. Can you tell what platform you're on, and what you get from vartrix --version?

omansn commented 3 years ago

I was running 1.1.14 but I just tried again with 1.1.16 and the problem still occurs.

Just to give more information and narrow your debugging space: This is a problem on two datasets out of ~10. One is a merged dataset (two 10x chips) where there were some barcode collisions. I tried vartrix with both with the -1 and -2 designation on the barcodes between runs. But I wasn't sure how vartrix handles these so I also ran it after subsetting the bam and barcode file so it only contains unique barcodes all ending in -1. Neither fixed the issue. The second dataset is a single 10x run, so all barcodes end in -1 and they are all unique. I'm not sure if this is a coincidence, but both datasets have >20k barcodes.

Thanks again for your time! Vartrix has been invaluable to my research.

P.s. your cargo files on 1.1.16 still say "version 1.1.14" so vartrix --version displays the incorrect version.

pmarks commented 3 years ago

Ok I added a --out-barcodes flag that makes vartrix write out a separate file with the set of barcodes it actually used. @omansn @MagpiePKU can you run with this flag and post a diff between the input barcodes file and the new output barcodes file? That should help us figure out what's going on.

omansn commented 3 years ago

Hi @pmarks,

I figured out the issue and it is totally my fault. There were in fact duplicated barcode entries (both ending in -1). I'm not sure how I missed this the first time I checked. Sorry for taking up your time on this since it was a user error. Regardless, the --out-barcodes flag will be very useful for sanity checks in the future.

Thank you! Nathan

pmarks commented 3 years ago

@omansn glad we got to the bottom of it! @MagpiePKU please reopen if you're still having trouble.