GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

The error occurred: bam shouldn't be empty #12

Closed chuiqin closed 1 year ago

chuiqin commented 1 year ago

Thank you for developing such a useful tool. But in the process of running, I encountered such an error, I hope to get your help. Thanks! When I execute scsnv pileup with the Cell Ranger BAM file The code:

/home/data/cq/scsnv/scsnv/build/src/scsnv pileup  -c -l V3 -i index_prefix  \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
-o sample/pileup -p /home/data/cq/ALL/04.barcode/JYC.txt.gz \
-t 4 -x 4 /home/data/cq/opt/JYC/outs/possorted_genome_bam.bam

Then, the error occurred:

(scsnv) /home/data/cq/scsnv/scsnv/build/src/scsnv pileup  -c -l V3 -i index_prefix  \
> -r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
> -o sample/pileup -p /home/data/cq/ALL/04.barcode/JYC.txt.gz \
> -t 4 -x 4 /home/data/cq/opt/JYC/outs/possorted_genome_bam.bam
[00:00:00] WARNING: Processing Cell Ranger without the duplicate flag enabled will produce a massive false negative rate for SNV calling!!!![00:00:00] Loading the genome
[00:00:09] Read 6991 passed barcodes
[00:00:09] Loading the transcriptome index
Splice site index built
bam shouldn't be empty
chuiqin commented 1 year ago

In addition, I also found a new problem. After I execute scsnv index, the folder /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/ only has the genome.fa file, but not the other five files mentioned in the issue (https://github.com/GWW/scsnv/issues/5).

genome.fa.amb
genome.fa.ann
genome.fa.bwt
genome.fa.pac
genome.fa.sa

Here is my code and result:

/home/data/cq/scsnv/scsnv/build/src/scsnv index \
-g /home/data/cq/opt/refdata-gex-GRCh38-2020-A/genes/genes.gtf \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
scsnv_index
(scsnv) /home/data/cq/scsnv/scsnv/build/src/scsnv index \
> -g /home/data/cq/opt/refdata-gex-GRCh38-2020-A/genes/genes.gtf \
> -r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
> scsnv_index
[00:00:00] Loading the GTF file
[00:00:06] Building Transcript Index
[00:00:10] Processed chr1 bases = 31,275,161, kept genes = 3,409 / 3,410 Kept transcripts = 17,411, skipped retained = 0, skipped NMD = 0, too short = 3 out of 17,414
[00:00:11] Processed chr10 bases = 12,800,184, kept genes = 1,394 / 1,394 Kept transcripts = 6,707, skipped retained = 0, skipped NMD = 0, too short = 1 out of 6,708
[00:00:13] Processed chr11 bases = 18,954,045, kept genes = 2,064 / 2,065 Kept transcripts = 11,960, skipped retained = 0, skipped NMD = 0, too short = 4 out of 11,964
[00:00:15] Processed chr12 bases = 18,352,787, kept genes = 1,928 / 1,928 Kept transcripts = 11,367, skipped retained = 0, skipped NMD = 0, too short = 2 out of 11,369
[00:00:16] Processed chr13 bases = 6,273,160, kept genes = 788 / 788 Kept transcripts = 3,484, skipped retained = 0, skipped NMD = 0, too short = 0 out of 3,484
[00:00:18] Processed chr14 bases = 11,524,175, kept genes = 1,365 / 1,474 Kept transcripts = 7,306, skipped retained = 0, skipped NMD = 0, too short = 110 out of 7,416
[00:00:19] Processed chr15 bases = 12,743,794, kept genes = 1,256 / 1,267 Kept transcripts = 7,187, skipped retained = 0, skipped NMD = 0, too short = 12 out of 7,199
[00:00:21] Processed chr16 bases = 14,770,254, kept genes = 1,649 / 1,649 Kept transcripts = 9,717, skipped retained = 0, skipped NMD = 0, too short = 2 out of 9,719
[00:00:23] Processed chr17 bases = 18,200,803, kept genes = 1,991 / 1,992 Kept transcripts = 12,104, skipped retained = 0, skipped NMD = 0, too short = 3 out of 12,107
[00:00:24] Processed chr18 bases = 6,631,816, kept genes = 781 / 781 Kept transcripts = 3,931, skipped retained = 0, skipped NMD = 0, too short = 2 out of 3,933
[00:00:25] Processed chr19 bases = 18,060,486, kept genes = 2,026 / 2,027 Kept transcripts = 12,402, skipped retained = 0, skipped NMD = 0, too short = 2 out of 12,404
[00:00:29] Processed chr2 bases = 25,841,656, kept genes = 2,535 / 2,541 Kept transcripts = 14,735, skipped retained = 0, skipped NMD = 0, too short = 10 out of 14,745
[00:00:30] Processed chr20 bases = 8,198,008, kept genes = 964 / 964 Kept transcripts = 4,841, skipped retained = 0, skipped NMD = 0, too short = 1 out of 4,842
[00:00:30] Processed chr21 bases = 4,350,176, kept genes = 555 / 555 Kept transcripts = 2,575, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2,575
[00:00:31] Processed chr22 bases = 7,288,564, kept genes = 897 / 901 Kept transcripts = 4,266, skipped retained = 0, skipped NMD = 0, too short = 4 out of 4,270
[00:00:33] Processed chr3 bases = 21,494,662, kept genes = 1,893 / 1,893 Kept transcripts = 12,568, skipped retained = 0, skipped NMD = 0, too short = 6 out of 12,574
[00:00:35] Processed chr4 bases = 14,579,565, kept genes = 1,533 / 1,533 Kept transcripts = 8,142, skipped retained = 0, skipped NMD = 0, too short = 0 out of 8,142
[00:00:37] Processed chr5 bases = 16,005,370, kept genes = 1,810 / 1,811 Kept transcripts = 9,508, skipped retained = 0, skipped NMD = 0, too short = 2 out of 9,510
[00:00:39] Processed chr6 bases = 16,639,350, kept genes = 1,825 / 1,827 Kept transcripts = 8,971, skipped retained = 0, skipped NMD = 0, too short = 4 out of 8,975
[00:00:41] Processed chr7 bases = 15,639,564, kept genes = 1,666 / 1,686 Kept transcripts = 9,198, skipped retained = 0, skipped NMD = 0, too short = 27 out of 9,225
[00:00:42] Processed chr8 bases = 12,817,841, kept genes = 1,495 / 1,495 Kept transcripts = 8,223, skipped retained = 0, skipped NMD = 0, too short = 2 out of 8,225
[00:00:44] Processed chr9 bases = 11,919,051, kept genes = 1,318 / 1,319 Kept transcripts = 6,337, skipped retained = 0, skipped NMD = 0, too short = 6 out of 6,343
[00:00:44] Processed chrM bases = 11,395, kept genes = 13 / 13 Kept transcripts = 13, skipped retained = 0, skipped NMD = 0, too short = 0 out of 13
[00:00:45] Processed chrX bases = 11,102,590, kept genes = 1,148 / 1,148 Kept transcripts = 5,596, skipped retained = 0, skipped NMD = 0, too short = 4 out of 5,600
[00:00:45] Processed chrY bases = 704,501, kept genes = 111 / 111 Kept transcripts = 345, skipped retained = 0, skipped NMD = 0, too short = 0 out of 345
[00:00:45] Processed KI270728.1 bases = 11,057, kept genes = 6 / 6 Kept transcripts = 8, skipped retained = 0, skipped NMD = 0, too short = 0 out of 8
[00:00:45] Processed KI270727.1 bases = 7,147, kept genes = 4 / 4 Kept transcripts = 5, skipped retained = 0, skipped NMD = 0, too short = 0 out of 5
[00:00:45] Processed GL000009.2 bases = 2,237, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed GL000194.1 bases = 3,778, kept genes = 2 / 2 Kept transcripts = 2, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2
[00:00:45] Processed GL000205.2 bases = 1,468, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed GL000195.1 bases = 3,663, kept genes = 2 / 2 Kept transcripts = 2, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2
[00:00:45] Processed GL000219.1 bases = 372, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed KI270734.1 bases = 8,012, kept genes = 3 / 3 Kept transcripts = 4, skipped retained = 0, skipped NMD = 0, too short = 0 out of 4
[00:00:45] Processed GL000213.1 bases = 4,131, kept genes = 1 / 1 Kept transcripts = 2, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2
[00:00:45] Processed GL000218.1 bases = 3,027, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed KI270731.1 bases = 2,404, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed KI270721.1 bases = 740, kept genes = 1 / 1 Kept transcripts = 1, skipped retained = 0, skipped NMD = 0, too short = 0 out of 1
[00:00:45] Processed KI270726.1 bases = 645, kept genes = 2 / 2 Kept transcripts = 2, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2
[00:00:45] Processed KI270711.1 bases = 10,594, kept genes = 1 / 1 Kept transcripts = 4, skipped retained = 0, skipped NMD = 0, too short = 0 out of 4
[00:00:45] Processed KI270713.1 bases = 1,341, kept genes = 2 / 2 Kept transcripts = 2, skipped retained = 0, skipped NMD = 0, too short = 0 out of 2
[00:00:45] Done bases = 336,239,574, kept genes = 36,443 out of 36,601 transcripts = 198,931, skipped retained = 0, skipped NMD = 0, too short = 207 out of 199,138
[00:00:45] Building the BWA-mem index
[bwa_index] Pack FASTA... 4.54 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=672479148, availableWord=854259186
[bwt_gen] Finished constructing BWT in 5 iterations.
[bwa_index] 669.61 seconds elapse.
[bwa_index] Update BWT... 2.64 sec
[bwa_index] Pack forward-only FASTA... 3.60 sec
[bwa_index] Construct SA from BWT and Occ... 81.79 sec
[00:13:31] Done!

Looking forward to your reply

GWW commented 1 year ago

Hi,

1) The bam file empty error is likely occurring because the barcode file you provided does not have a barcode that matches one in the Cell Ranger CB tag. Please check the bam file to make sure the CB tag format is the same as that in your barcode text file. For example, sometimes barcodes from Cell Ranger have -1 suffixes. Can you post a few lines of your barcode file and a few example bam records from your possorted bam file?

2) For your second issue the genome.fa.amb files need to be generated using bwa index. I didn't have scSNV automatically generate these as most groups have bwa indicies already for the full genome.

Gavin

chuiqin commented 1 year ago

Thanks for your answer. I double checked my data and finally found the source of the error. This is a sloppy mistake. In my data, my barcoe file is like this:

(scsnv) zless /home/data/cq/ALL/04.barcode/JYC.txt.gz|head
"barcode"
"AAACCCAAGCTCCCTT"
"AAACCCAAGGACTTCT"
"AAACCCACAAACCACT"
"AAACCCACAGATACTC"
"AAACCCAGTTCGAACT"
"AAACCCATCAGCGTCG"
"AAACCCATCGTTAGAC"
"AAACCCATCTGATTCT"

But in fact, the barcode file required by the scsnv package should be like this:

(scsnv) zless /home/data/cq/ALL/04.barcode/JYC.txt.gz|head
barcode
AAACCCAAGCTCCCTT
AAACCCAAGGACTTCT
AAACCCACAAACCACT
AAACCCACAGATACTC
AAACCCAGTTCGAACT
AAACCCATCAGCGTCG
AAACCCATCGTTAGAC
AAACCCATCTGATTCT
AAACGAAAGCCATGCC

Obviously, there are more quotation marks on the barcode string.

GWW commented 1 year ago

No problem. It's easy to mess up input files.

chuiqin commented 1 year ago

I'm sorry to bother you again. After I executed the scsnvmisc annotate command, 4 files (pileup_annotated.h5, pileup_barcodes.txt.gz, pileup_filtering.txt, pileup_passed_snvs.txt.gz) were generated in the sample folder. However, the sample folder in the tutorial has so many files:

sample/pileup_annotated.h5
sample/pileup_barcodes.txt.gz
sample/pileup.txt.gz
sample/pileup_barcode_matrices.h5
sample/cells.h5
sample/summary.h5
sample/snv_map.txt.gz
sample/snv_base_counts.txt.gz
sample/snv_reads.txt.gz
sample/barcode_bases.txt.gz

The code and result of executing the scsnvmisc annotate command are as follows:

(scsnv) scsnvmisc annotate -r /home/data/cq/scsnv/repeatrmsk.txt.gz \
> -d /home/data/cq/scsnv/1000GENOMES/1000GENOMES.txt.gz \
> -e /home/data/cq/scsnv/REDIportal_hg38.txt.gz \
> sample/pileup
Processing sample/pileup
Kept 397778 out of 415823
[ True True True ... True True True ] (415823,) (415823, 6991) (415823, 6991)
Parsing 1000 Genomes file
Annotating the SNVs
Found in 1000 Genomes: 0
Annotating repeats
Building NCLS for each reference
Querying each SNV
SNVs with a repeat masker overlap: 301853
Annotating edits
REDIportal 143148
(scsnv)
(scsnv) tree sample/
sample/
├── pileup_annotated.h5
├── pileup_barcode_matrices.h5
├── pileup_barcodes.txt.gz
├── pileup_filtering.txt
├── pileup_passed_snvs.txt.gz
└── pileup.txt.gz

0 directories, 6 files
(scsnv) ls -lh sample/
total 1.5G
-rw-rw-r-- 1 cq cq 110M Oct 20 16:19 pileup_annotated.h5
-rw-rw-r-- 1 cq cq 1.4G Oct 20 12:19 pileup_barcode_matrices.h5
-rw-rw-r-- 1 cq cq 109K Oct 20 12:19 pileup_barcodes.txt.gz
-rw-rw-r-- 1 cq cq 252 Oct 20 15:44 pileup_filtering.txt
-rw-rw-r-- 1 cq cq 7.7M Oct 20 15:48 pileup_passed_snvs.txt.gz
-rw-rw-r-- 1 cq cq 7.1M Oct 20 12:19 pileup.txt.gz

The complete execution code is as follows:

# Build the index
/home/data/cq/scsnv/scsnv/build/src/scsnv index \
-g /home/data/cq/opt/refdata-gex-GRCh38-2020-A/genes/genes.gtf \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
index_prefix/index_prefix
# scsnv pileup
/home/data/cq/scsnv/scsnv/build/src/scsnv pileup -c -l V3 -i index_prefix/index_prefix \
-r /home/data/cq/opt/refdata-gex-GRCh38-2020-A/fasta/genome.fa \
-o sample/pileup -p /home/data/cq/ALL/04.barcode/JYC.txt.gz \
-t 4 -x 4 /home/data/cq/opt/JYC/outs/possorted_genome_bam.bam
# annotate
scsnvmisc annotate -r /home/data/cq/scsnv/repeatrmsk.txt.gz \
-d /home/data/cq/scsnv/1000GENOMES/1000GENOMES.txt.gz \
-e /home/data/cq/scsnv/REDIportal_hg38.txt.gz \
sample/pileup

Am I missing any steps? Also, after checking 4 files (pileup_annotated.h5, pileup_barcodes.txt.gz, pileup_filtering.txt, pileup_passed_snvs.txt.gz), I can't make good use of the output of scsnv package. If I want to get a matrix (rows are cells and columns are the SNVs of this cell) which output file should I utilize? Can your help me again?

Supplement: This single-cell transcriptome-sequencing sample was an acute B-cell leukemia sample that was found to be 80% naive on flow cytometry.