Closed mlbfalchetti closed 1 year ago
Hi Marcelo, Thanks for using our tool. Here there are a couple of reasons to get this message:
There are not enough reads in the input bam file to reach the minimum of 5x required to get base counts in a genomic site (--min_dp 5 by default, which can be changed). To check that, could you please re-check the size of the input bam file, and maybe show a screenshot of the first few lines of the aligned reads (run:samtools view /data/SComatic/project/Step1_BamCellTypes/Sample.control.bam | head
)
The provided reference file is not the same as that used for the mapping. One typical problem with Hg38 references is the presence (or not) of the chromosome prefix (chr1 vs 1). Could you please check that this is not the case?
In that regard, and to take advantage of our human data analysis (described in the SComatic manuscript), we strongly recommend using the UCSC reference genome for both alignment and SComatic analysis (you can download it from here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz ). With that, you will be able to use the Panel Of Normals (PoN) and RNA-editing files that we generated for the last steps of the variant calling (also provided in github).
Finally, we have provided a toy example to test SComatic in your cluster/computer. You will find the instructions to run it here: (https://github.com/cortes-ciriano-lab/SComatic/blob/main/docs/SComaticExample.md) . Could you test it and tell me if you have any error messages?
I see that you have some warning messages and that you did use python 3.8, could you test as well with the conda environment that we suggested: conda create -n SComatic -c bioconda python=3.7 r-base=3.6.1 samtools datamash bedtools
Thanks, Fran
Hello, thanks for the replies. I noticed that the .bam files created by the SplitBamCellTypes.py function are empty... So I wonder if there is something wrong with my original .bam files. Follow their
samtools view /data/SComatic/project/G8_Aligned.sortedByCoord.out.bam | head
YWDT148286785 16 1 12668 0 60M1S * 0 0 AAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF, NH:i:5 HI:i:1 AS:i:59 nM:i:0
YWDT416821046 16 1 12670 0 61M * 0 0 GACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGA :F:F::FF,:FFFFF:F:FFFFFFF,FFFFFFF:FFF,FFFFFF:F,,FFF,FFFFFFF:, NH:i:5 HI:i:1 AS:i:60 nM:i:0
YWDT600879601 16 1 12709 0 61M * 0 0 CAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCT FFFFFFFFFFFFFFFFFFFFFFFF,,:::F:::FFFFFFFFFFFFFFFFFFFFFFFFFFF, NH:i:6 HI:i:1 AS:i:60 nM:i:0
YWDT614485717 16 1 12790 1 61M * 0 0 GTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF: NH:i:4 HI:i:1 AS:i:60 nM:i:0
RRTT651436642 16 1 12829 1 58M1S * 0 0 GATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCA FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:3 HI:i:1 AS:i:57 nM:i:0
Is there something wrong with this output?
Hi Marcelo, This bam does not seem to be suitable for SComatic. I cannot see the cell barcode information (CB tag with the cell barcode sequence). This CB tag is mandatory to know how to link the reads with their corresponding cells. How did you generate this bam? What protocol did you use?
If you check our example bam file with this command samtools view example_data/Example.scrnaseq.bam | head
, you will see how the CB tag should look like. This is the most common bam file format that you get when working with single-cell RNA-seq 10X data and CellRanger.
K00335:135:HNCC5BBXX:8:1225:27133:19038 16 chr10 29315234 3 58M533512N40M * 0 0 GCGCGAAGACTCACGCCTGTAATCCCAGCACTTTGGTAGGCCAAGGTGGGCGGATCACCTGAGGTCATCAGTTCGAGACCAGCCTGACCAACATGGAG ---<7--A7--7A-A-AFF-7-A--FJAFFFJA7FA<JJ7JFF77-7--<FAAFA----FAA-FF77--FJAA<-JJAAA---FJF-F7--77-<<-A CB:Z:CATCGAAGTTGTACAC-1 UB:Z:TCTCTTTCGA RE:A:I RG:Z:1 NH:i:2 HI:i:1 nM:i:5 CR:Z:CATCGAAGTTGTACAC UR:Z:TCTCTTTCGA AS:i:83 CY:Z:AAFAFJJJJJJJJFJF UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2110:26819:10721 0 chr10 29400586 255 34M221400N64M * 0 0 CATGCCTGTAATCATAACACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCT <<<AA-<7<FJJ<JJJJJJJF77<A<JJA-AJFA7<AA<AFFFJFFJFJJJFJJAFJJJJJJFJJJJJFJJ7FAFFJJJJJFJJJJFFJJFF<AJJJF CB:Z:GACAGAGAGAAACCAT-1 UB:Z:CTTGTTTAAG RE:A:I RG:Z:1 NH:i:1 HI:i:1 nM:i:1 CR:Z:GACAGAGAGAAACCAT UR:Z:CTTGTTTAAG AS:i:92 CY:Z:AAFFFJJJJJJJJJJJ UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2228:29477:5200 16 chr10 29458236 255 10M146108N88M * 0 0 CGTCTCCCCCTGTTTTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTATTTTTAGTAAGAGACAGGGTTTTGCCACGTTGGCCAGGCTGGTCTC --7--A<---7--7-----FJFJJJJJJJAAJJAJJJJJFFJJJFJJFJJJFJJJJJJJJJJJFJJJFJJJJJJJFJFFJJJJJJFF7A<JJ<AF<<< CB:Z:CCCAATCGTAATCACC-1 UB:Z:AAGTGGGCTG RE:A:N RG:Z:1 NH:i:1 HI:i:1 nM:i:8 CR:Z:CCCAATCGTAATCACC UR:Z:AAGTGGGCTG AS:i:78 CY:Z:AAFFFJJJJJJJJJJJ UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2211:23805:42987 0 chr10 29481132 255 28S34M233580N36M * 0 0 GCAGTGGTATCAACGCAGAGTACATGGGGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGCTCAGGAGTTCAAGACCAGCCTTGGGAACATGGTG <<FF-FF<JFFJJJFFJJFJJJFJJJJJJJJ<AAF-FAF7F7FAF7F<JFAAJJJFFFJJJJJF-AAJJAF--7FJ77--AF-A--AJJ-777-<A7F CB:Z:GGACATTCACTTCTGC-1 UB:Z:CCAACCAACC RE:A:N RG:Z:1 NH:i:1 HI:i:1 nM:i:0 CR:Z:GGACATTCACTTCTGC UR:Z:CCAACCAACC AS:i:58 CY:Z:AAAFFJJJJJJJJJJJ UY:Z:JJJJJJJJJJ xf:i:0 ts:i:28
K00335:135:HNCC5BBXX:8:1218:19421:43937 16 chr10 29481134 3 1S52M248265N45M * 0 0 TGTTTTTGTGTGTGTGTGTGTGTGTTTGTGTGTGTTTGTTTGTGTGTGTGTTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG A7-----FFA-7--7-77FA7---7-777FA<<----A7-7---7-7-7---7-----AF7FJFJAJAFFFJJJFAJ<JJFFJJJJJJJFJJJFFF<A UB:Z:CCGCAATAAT RE:A:N RG:Z:1 NH:i:2 HI:i:1 nM:i:5 CR:Z:ACTACTGTGAGGGAGA UR:Z:CCGCAATAAT AS:i:83 CY:Z:AAFFFJJJJJJJJJJJ UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:1217:25905:47858 0 chr10 29498130 3 14S51M123833N33M * 0 0 TGAAAATAAACGTCTATGGGGGCTGGGCACAGTGGATCAAGACTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATC A---<A<FF--A-AA-A<AF<A7-77-AAAAA--7-7AJ7--<-7A<J<<--7F--7-7A-F-77FJ7AJJF7A-<-7A--AFJ7-<J-7A<7AJF-- CB:Z:GCTTCCAGTACGCACC-1 UB:Z:GCTTAGCTGG RE:A:N RG:Z:1 NH:i:2 HI:i:1 nM:i:3 CR:Z:GCTTCCAGTACGCACC UR:Z:GCTTAGCTGG AS:i:74 CY:Z:-AFFFJJJJAJJJJJJ UY:Z:JJ<FAJFJJ< xf:i:0
K00335:135:HNCC5BBXX:8:2226:28706:27742 16 chr10 29498149 255 1S51M126120N46M * 0 0 GGGCTCACGCCTATAATCCCCGTACTTTGGGAAGTCGAAGCGGGTGTATCACCTGAGGTCAGGAGTACAAGACCAGCCTGGCCAACATGGCGAAACCC -77FFA<A-<7-77-7---7--JJJF7A7---7<-A---AFF7777-FF<F-AF-<A-7-<--7---JAFJJFJ<JJJAJ<<7F<<<A-A7-A7---< CB:Z:GAACCTAAGTAGATGT-1 UB:Z:CACAGCTAGA RE:A:N RG:Z:1 NH:i:1 HI:i:1 nM:i:7 CR:Z:GAACCTAAGTAGATGT UR:Z:CACAGCTAGA AS:i:79 CY:Z:AAFFFJFJJJFFAJJ< UY:Z:FJFFJJJFFJ xf:i:0
K00335:135:HNCC5BBXX:8:2112:8826:48474 16 chr10 29502676 255 31M226744N40M27S * 0 0 TGTGTGTGTGTGGGTGGGTGTGTGGGTGTGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCATGTACTCTGCGTTGATACCACTG A<-FF7A-A7-7-F<7-777-A<A-FF7-A7A-FA<FAA-AF-JA<-F---JFJFJFJFJFJFF7A<AAFFJJJF-J-FA7F-F7F<-J<FFJAF<<A UB:Z:AGTACATGGT RE:A:N RG:Z:1 NH:i:1 HI:i:1 nM:i:3 CR:Z:AGTGGTATCAACGCAG UR:Z:AGTACATGGT AS:i:61 CY:Z:AAAFA-AJJ<JJAJJA UY:Z:A7F<F7F<7- xf:i:0 ts:i:27
K00335:135:HNCC5BBXX:8:2219:31690:4462 0 chr10 29509615 3 16M209261N82M * 0 0 GAGGCAGGTGGATCACCTGAGGTCAAGAGTTTAAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTCCTAAAAGTACAAAAATTAGCTGGGCGTG -<-7FJ-<A<-FAF7AJ-7FF--A--<<----<-<F7F-F<AFJ7A---77-7F7-A<-<--77-7-FF-A---7<7-7-FA<FJ-A<FAAFAFJ<77 CB:Z:GAGGTGAAGAATCTCC-1 UB:Z:GGCACTATTA RE:A:N RG:Z:1 NH:i:2 HI:i:1 nM:i:6 CR:Z:GAGGTGAAGAATCTCC UR:Z:GGCACTATTA AS:i:82 CY:Z:A<<FFFF-FAJJFFAF UY:Z:7JFF<JFJJJ xf:i:0
K00335:135:HNCC5BBXX:8:1214:15950:33000 0 chr10 29528444 3 78M132428N20M * 0 0 AACAACAAAACTAGGCTGGGTACAGTGGCTCACGCCTGTAGCTCCAGCACTTTGGGAGGCTGAGGTGGGAGGATCGCTTGAGGTCAAGAGATCAAGAC A<FFFJJJJJJJJF7FAA7AJJJAFAJFJFJJJJJJJFJFJJFJJFAJJJAFJJJFJJJJJJJJJFFJJFJJ<FJFJFJFJJJJJJJJJJJJJJJJFF CB:Z:CACCAGGAGGCGACAT-1 UB:Z:TTAGTAAGGT RE:A:N RG:Z:1 NH:i:2 HI:i:1 nM:i:0 CR:Z:CACCAGGAGGCGACAT UR:Z:TTAGTAAGGT AS:i:86 CY:Z:AAFFFJJJJJJJJJJJ UY:Z:JJJJJJJJJJ xf:i:0
Cheers, Fran
Thank you again!
Our data was sequenced in Drop-Seq, so we couldn't use CellRanger to generate the files... Do you know of a way to use Drop-Seq files so that the .bam files are suitable for SComatic?
Cheers, Marcelo
The only extra thing SComatic requires is to have the CB tag in the bam. Then, you would need (you could use for instance the python library pysam) to add the CB tag with the cell barcode sequence to each read in the bam file. Cheers, Fran
I have the same issue with bam files are empty and I have a CB tag... An example read in my file:
A00269:184:HHLFVDMXX:2:1228:23791:21652 256 chr1 11291 0 91M * 0 0 TGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF,FFF:FFF,FFFFF:,:FFFFFFFFFFFF CR:Z:TGATGGTGTATAGGAT CY:Z:FFFFFFFFFFFFFFFF GX:Z:- GN:Z:- UR:Z:CGAAAAATGAAA UY:Z:FFFFFFFFFFFF CB:Z:TGATGGTGTATAGGAT UB:Z:CGAAAAATGAAA
Dear user, What aligner did you use in your pipeline? And what single-cell protocol did you use?
The SplitBamCellTypes.py script assumes that the aligner provides the CB (cell barcode), nM (number of mismatches) and NH (number of hits). The first one is used to get the cell barcode, and the last two tags are used to remove low-quality reads (reads with too many mismatches, or mapping to more than one genomic location).
It seems that your alignment pipeline did not include the nM and NH tags. In the current version, nM and NH are required to evaluate each read and remove low-qual reads (when these tags are not present, the read is labelled automatically as a low-quality read). However, I will update this part of the script (in the coming days) to allow the user to run the tool without these tags (nM and NH). Unfortunately, the CB tab is and will be mandatory.
Cheers, Fran
I used STARSolo, but I didn't turn on tags for nM and NH, will rerun with those tags.
Thanks, Chang
Hi people,
We have updated the scripts/SplitBam/SplitBamCellTypes.py script to permit SComatic to work without nM and NH tags. However, we strongly suggest using the _--maxnM 5 and _--maxNH 1 parameters whenever possible to remove low-quality reads and achieve similar performances as described in the SComatic manuscript. In addition, the new script version provides a more detailed report file (*.report.txt) with the number of PASS and filtered reads, as well as the reason why the reads were filtered out.
Thanks for all your feedback, Fran
Hello, I am encountering this problem when I run "BaseCellCounter.py". Can you guys help me?
-- Code
python /data/SComatic/scripts/BaseCellCounter.py --bam /data/SComatic/project/Step1_BamCellTypes/Sample.control.bam \ --ref /data/SComatic/project/GRCh38_latest_genomic.fna \ --chrom all \ --out_folder /data/SComatic/project/Step2_BaseCellCounts \ --min_bq 30 \ --tmp_dir /data/SComatic/project/Step2_BaseCellCounts/temp_control \ --nprocs 30
-- Output
Outfile: /data/SComatic/project/Step2_BaseCellCounts/Sample.control.tsv
Directory /data/SComatic/project/Step2_BaseCellCounts/temp_control already exists
/home/marcelo/.local/share/r-miniconda/envs/r-reticulate/lib/python3.8/subprocess.py:853: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used self.stderr = io.open(errread, 'rb', bufsize) No temporary files found Computation time: 9 seconds