cortes-ciriano-lab / SComatic

A tool for detecting somatic variants in single cell data
Other
151 stars 24 forks source link

No temporary files found in BaseCellCounter.py #2

Closed mlbfalchetti closed 1 year ago

mlbfalchetti commented 1 year ago

Hello, I am encountering this problem when I run "BaseCellCounter.py". Can you guys help me?

-- Code

python /data/SComatic/scripts/BaseCellCounter.py --bam /data/SComatic/project/Step1_BamCellTypes/Sample.control.bam \ --ref /data/SComatic/project/GRCh38_latest_genomic.fna \ --chrom all \ --out_folder /data/SComatic/project/Step2_BaseCellCounts \ --min_bq 30 \ --tmp_dir /data/SComatic/project/Step2_BaseCellCounts/temp_control \ --nprocs 30

-- Output

Outfile: /data/SComatic/project/Step2_BaseCellCounts/Sample.control.tsv

Directory /data/SComatic/project/Step2_BaseCellCounts/temp_control already exists

/home/marcelo/.local/share/r-miniconda/envs/r-reticulate/lib/python3.8/subprocess.py:853: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used self.stderr = io.open(errread, 'rb', bufsize) No temporary files found Computation time: 9 seconds

Francesc-Muyas commented 1 year ago

Hi Marcelo, Thanks for using our tool. Here there are a couple of reasons to get this message:

  1. There are not enough reads in the input bam file to reach the minimum of 5x required to get base counts in a genomic site (--min_dp 5 by default, which can be changed). To check that, could you please re-check the size of the input bam file, and maybe show a screenshot of the first few lines of the aligned reads (run:samtools view /data/SComatic/project/Step1_BamCellTypes/Sample.control.bam | head )

  2. The provided reference file is not the same as that used for the mapping. One typical problem with Hg38 references is the presence (or not) of the chromosome prefix (chr1 vs 1). Could you please check that this is not the case?

In that regard, and to take advantage of our human data analysis (described in the SComatic manuscript), we strongly recommend using the UCSC reference genome for both alignment and SComatic analysis (you can download it from here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz ). With that, you will be able to use the Panel Of Normals (PoN) and RNA-editing files that we generated for the last steps of the variant calling (also provided in github).

  1. Finally, we have provided a toy example to test SComatic in your cluster/computer. You will find the instructions to run it here: (https://github.com/cortes-ciriano-lab/SComatic/blob/main/docs/SComaticExample.md) . Could you test it and tell me if you have any error messages?

  2. I see that you have some warning messages and that you did use python 3.8, could you test as well with the conda environment that we suggested: conda create -n SComatic -c bioconda python=3.7 r-base=3.6.1 samtools datamash bedtools

Thanks, Fran

mlbfalchetti commented 1 year ago

Hello, thanks for the replies. I noticed that the .bam files created by the SplitBamCellTypes.py function are empty... So I wonder if there is something wrong with my original .bam files. Follow their

samtools view /data/SComatic/project/G8_Aligned.sortedByCoord.out.bam | head

YWDT148286785   16      1       12668   0       60M1S   *       0       0       AAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAT     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF,   NH:i:5  HI:i:1  AS:i:59 nM:i:0
YWDT416821046   16      1       12670   0       61M     *       0       0       GACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGA     :F:F::FF,:FFFFF:F:FFFFFFF,FFFFFFF:FFF,FFFFFF:F,,FFF,FFFFFFF:,   NH:i:5  HI:i:1  AS:i:60 nM:i:0
YWDT600879601   16      1       12709   0       61M     *       0       0       CAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCT     FFFFFFFFFFFFFFFFFFFFFFFF,,:::F:::FFFFFFFFFFFFFFFFFFFFFFFFFFF,   NH:i:6  HI:i:1  AS:i:60 nM:i:0
YWDT614485717   16      1       12790   1       61M     *       0       0       GTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCT     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:   NH:i:4  HI:i:1  AS:i:60 nM:i:0
RRTT651436642   16      1       12829   1       58M1S   *       0       0       GATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCCTCCA       FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     NH:i:3  HI:i:1  AS:i:57 nM:i:0

Is there something wrong with this output?

Francesc-Muyas commented 1 year ago

Hi Marcelo, This bam does not seem to be suitable for SComatic. I cannot see the cell barcode information (CB tag with the cell barcode sequence). This CB tag is mandatory to know how to link the reads with their corresponding cells. How did you generate this bam? What protocol did you use?

If you check our example bam file with this command samtools view example_data/Example.scrnaseq.bam | head, you will see how the CB tag should look like. This is the most common bam file format that you get when working with single-cell RNA-seq 10X data and CellRanger.

K00335:135:HNCC5BBXX:8:1225:27133:19038 16  chr10   29315234    3   58M533512N40M   *   0   0   GCGCGAAGACTCACGCCTGTAATCCCAGCACTTTGGTAGGCCAAGGTGGGCGGATCACCTGAGGTCATCAGTTCGAGACCAGCCTGACCAACATGGAG  ---<7--A7--7A-A-AFF-7-A--FJAFFFJA7FA<JJ7JFF77-7--<FAAFA----FAA-FF77--FJAA<-JJAAA---FJF-F7--77-<<-A  CB:Z:CATCGAAGTTGTACAC-1 UB:Z:TCTCTTTCGA RE:A:I  RG:Z:1  NH:i:2  HI:i:1  nM:i:5  CR:Z:CATCGAAGTTGTACAC   UR:Z:TCTCTTTCGA AS:i:83 CY:Z:AAFAFJJJJJJJJFJF   UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2110:26819:10721 0   chr10   29400586    255 34M221400N64M   *   0   0   CATGCCTGTAATCATAACACTTTGGGAGGCCGAGGTGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTCTCT  <<<AA-<7<FJJ<JJJJJJJF77<A<JJA-AJFA7<AA<AFFFJFFJFJJJFJJAFJJJJJJFJJJJJFJJ7FAFFJJJJJFJJJJFFJJFF<AJJJF  CB:Z:GACAGAGAGAAACCAT-1 UB:Z:CTTGTTTAAG RE:A:I  RG:Z:1  NH:i:1  HI:i:1  nM:i:1  CR:Z:GACAGAGAGAAACCAT   UR:Z:CTTGTTTAAG AS:i:92 CY:Z:AAFFFJJJJJJJJJJJ   UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2228:29477:5200  16  chr10   29458236    255 10M146108N88M   *   0   0   CGTCTCCCCCTGTTTTGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGTATTTTTAGTAAGAGACAGGGTTTTGCCACGTTGGCCAGGCTGGTCTC  --7--A<---7--7-----FJFJJJJJJJAAJJAJJJJJFFJJJFJJFJJJFJJJJJJJJJJJFJJJFJJJJJJJFJFFJJJJJJFF7A<JJ<AF<<<  CB:Z:CCCAATCGTAATCACC-1 UB:Z:AAGTGGGCTG RE:A:N  RG:Z:1  NH:i:1  HI:i:1  nM:i:8  CR:Z:CCCAATCGTAATCACC   UR:Z:AAGTGGGCTG AS:i:78 CY:Z:AAFFFJJJJJJJJJJJ   UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:2211:23805:42987 0   chr10   29481132    255 28S34M233580N36M    *   0   0   GCAGTGGTATCAACGCAGAGTACATGGGGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGCTCAGGAGTTCAAGACCAGCCTTGGGAACATGGTG  <<FF-FF<JFFJJJFFJJFJJJFJJJJJJJJ<AAF-FAF7F7FAF7F<JFAAJJJFFFJJJJJF-AAJJAF--7FJ77--AF-A--AJJ-777-<A7F  CB:Z:GGACATTCACTTCTGC-1 UB:Z:CCAACCAACC RE:A:N  RG:Z:1  NH:i:1  HI:i:1  nM:i:0  CR:Z:GGACATTCACTTCTGC   UR:Z:CCAACCAACC AS:i:58 CY:Z:AAAFFJJJJJJJJJJJ   UY:Z:JJJJJJJJJJ xf:i:0  ts:i:28
K00335:135:HNCC5BBXX:8:1218:19421:43937 16  chr10   29481134    3   1S52M248265N45M *   0   0   TGTTTTTGTGTGTGTGTGTGTGTGTTTGTGTGTGTTTGTTTGTGTGTGTGTTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG  A7-----FFA-7--7-77FA7---7-777FA<<----A7-7---7-7-7---7-----AF7FJFJAJAFFFJJJFAJ<JJFFJJJJJJJFJJJFFF<A  UB:Z:CCGCAATAAT RE:A:N  RG:Z:1  NH:i:2  HI:i:1  nM:i:5  CR:Z:ACTACTGTGAGGGAGA   UR:Z:CCGCAATAAT AS:i:83 CY:Z:AAFFFJJJJJJJJJJJ   UY:Z:JJJJJJJJJJ xf:i:0
K00335:135:HNCC5BBXX:8:1217:25905:47858 0   chr10   29498130    3   14S51M123833N33M    *   0   0   TGAAAATAAACGTCTATGGGGGCTGGGCACAGTGGATCAAGACTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATC  A---<A<FF--A-AA-A<AF<A7-77-AAAAA--7-7AJ7--<-7A<J<<--7F--7-7A-F-77FJ7AJJF7A-<-7A--AFJ7-<J-7A<7AJF--  CB:Z:GCTTCCAGTACGCACC-1 UB:Z:GCTTAGCTGG RE:A:N  RG:Z:1  NH:i:2  HI:i:1  nM:i:3  CR:Z:GCTTCCAGTACGCACC   UR:Z:GCTTAGCTGG AS:i:74 CY:Z:-AFFFJJJJAJJJJJJ   UY:Z:JJ<FAJFJJ< xf:i:0
K00335:135:HNCC5BBXX:8:2226:28706:27742 16  chr10   29498149    255 1S51M126120N46M *   0   0   GGGCTCACGCCTATAATCCCCGTACTTTGGGAAGTCGAAGCGGGTGTATCACCTGAGGTCAGGAGTACAAGACCAGCCTGGCCAACATGGCGAAACCC  -77FFA<A-<7-77-7---7--JJJF7A7---7<-A---AFF7777-FF<F-AF-<A-7-<--7---JAFJJFJ<JJJAJ<<7F<<<A-A7-A7---<  CB:Z:GAACCTAAGTAGATGT-1 UB:Z:CACAGCTAGA RE:A:N  RG:Z:1  NH:i:1  HI:i:1  nM:i:7  CR:Z:GAACCTAAGTAGATGT   UR:Z:CACAGCTAGA AS:i:79 CY:Z:AAFFFJFJJJFFAJJ<   UY:Z:FJFFJJJFFJ xf:i:0
K00335:135:HNCC5BBXX:8:2112:8826:48474  16  chr10   29502676    255 31M226744N40M27S    *   0   0   TGTGTGTGTGTGGGTGGGTGTGTGGGTGTGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTCCCATGTACTCTGCGTTGATACCACTG  A<-FF7A-A7-7-F<7-777-A<A-FF7-A7A-FA<FAA-AF-JA<-F---JFJFJFJFJFJFF7A<AAFFJJJF-J-FA7F-F7F<-J<FFJAF<<A  UB:Z:AGTACATGGT RE:A:N  RG:Z:1  NH:i:1  HI:i:1  nM:i:3  CR:Z:AGTGGTATCAACGCAG   UR:Z:AGTACATGGT AS:i:61 CY:Z:AAAFA-AJJ<JJAJJA   UY:Z:A7F<F7F<7- xf:i:0  ts:i:27
K00335:135:HNCC5BBXX:8:2219:31690:4462  0   chr10   29509615    3   16M209261N82M   *   0   0   GAGGCAGGTGGATCACCTGAGGTCAAGAGTTTAAGACCAGCCTGGCCAACATGGTGAAACCCTGTCTCTCCTAAAAGTACAAAAATTAGCTGGGCGTG  -<-7FJ-<A<-FAF7AJ-7FF--A--<<----<-<F7F-F<AFJ7A---77-7F7-A<-<--77-7-FF-A---7<7-7-FA<FJ-A<FAAFAFJ<77  CB:Z:GAGGTGAAGAATCTCC-1 UB:Z:GGCACTATTA RE:A:N  RG:Z:1  NH:i:2  HI:i:1  nM:i:6  CR:Z:GAGGTGAAGAATCTCC   UR:Z:GGCACTATTA AS:i:82 CY:Z:A<<FFFF-FAJJFFAF   UY:Z:7JFF<JFJJJ xf:i:0
K00335:135:HNCC5BBXX:8:1214:15950:33000 0   chr10   29528444    3   78M132428N20M   *   0   0   AACAACAAAACTAGGCTGGGTACAGTGGCTCACGCCTGTAGCTCCAGCACTTTGGGAGGCTGAGGTGGGAGGATCGCTTGAGGTCAAGAGATCAAGAC  A<FFFJJJJJJJJF7FAA7AJJJAFAJFJFJJJJJJJFJFJJFJJFAJJJAFJJJFJJJJJJJJJFFJJFJJ<FJFJFJFJJJJJJJJJJJJJJJJFF  CB:Z:CACCAGGAGGCGACAT-1 UB:Z:TTAGTAAGGT RE:A:N  RG:Z:1  NH:i:2  HI:i:1  nM:i:0  CR:Z:CACCAGGAGGCGACAT   UR:Z:TTAGTAAGGT AS:i:86 CY:Z:AAFFFJJJJJJJJJJJ   UY:Z:JJJJJJJJJJ xf:i:0

Cheers, Fran

mlbfalchetti commented 1 year ago

Thank you again!

Our data was sequenced in Drop-Seq, so we couldn't use CellRanger to generate the files... Do you know of a way to use Drop-Seq files so that the .bam files are suitable for SComatic?

Cheers, Marcelo

Francesc-Muyas commented 1 year ago

The only extra thing SComatic requires is to have the CB tag in the bam. Then, you would need (you could use for instance the python library pysam) to add the CB tag with the cell barcode sequence to each read in the bam file. Cheers, Fran

cnk113 commented 1 year ago

I have the same issue with bam files are empty and I have a CB tag... An example read in my file:

A00269:184:HHLFVDMXX:2:1228:23791:21652 256     chr1    11291   0       91M     *       0       0       TGCTGGCGCCGGGGCACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCCGCCTGCTGGCAGCTAGGGACATTGCAGGGTCCTCT     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,FFFFFFFFFFFFFF,FFF:FFF,FFFFF:,:FFFFFFFFFFFF        CR:Z:TGATGGTGTATAGGAT   CY:Z:FFFFFFFFFFFFFFFF   GX:Z:-  GN:Z:-  UR:Z:CGAAAAATGAAA       UY:Z:FFFFFFFFFFFF CB:Z:TGATGGTGTATAGGAT    UB:Z:CGAAAAATGAAA
Francesc-Muyas commented 1 year ago

Dear user, What aligner did you use in your pipeline? And what single-cell protocol did you use?

The SplitBamCellTypes.py script assumes that the aligner provides the CB (cell barcode), nM (number of mismatches) and NH (number of hits). The first one is used to get the cell barcode, and the last two tags are used to remove low-quality reads (reads with too many mismatches, or mapping to more than one genomic location).

It seems that your alignment pipeline did not include the nM and NH tags. In the current version, nM and NH are required to evaluate each read and remove low-qual reads (when these tags are not present, the read is labelled automatically as a low-quality read). However, I will update this part of the script (in the coming days) to allow the user to run the tool without these tags (nM and NH). Unfortunately, the CB tab is and will be mandatory.

Cheers, Fran

cnk113 commented 1 year ago

I used STARSolo, but I didn't turn on tags for nM and NH, will rerun with those tags.

Thanks, Chang

Francesc-Muyas commented 1 year ago

Hi people,

We have updated the scripts/SplitBam/SplitBamCellTypes.py script to permit SComatic to work without nM and NH tags. However, we strongly suggest using the _--maxnM 5 and _--maxNH 1 parameters whenever possible to remove low-quality reads and achieve similar performances as described in the SComatic manuscript. In addition, the new script version provides a more detailed report file (*.report.txt) with the number of PASS and filtered reads, as well as the reason why the reads were filtered out.

Thanks for all your feedback, Fran