CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Dedup outputting empty BAM #562

Closed richardsalisbury closed 1 year ago

richardsalisbury commented 1 year ago

Dear Ian, I'm using umi-tools on some libraries I have created using smartseq 3. These have been created using low cell number input rather than single cell so I have demultiplexed to create 25 fastq files. Smartseq3 inserts a 11bp tag, followed by an 8bp UMI using the TSO primer. The cDNA is pre-amplified, cut up using Tn5 and then amplified using nextera index primers (dual index). Libraries were pooled and sequenced 75bp paired end with additional index reads so cell barcodes are therefore not in the processed fastq files. I ran trim_galore to filter adapters/low quality bp, then umi-tools extract to find reads containing the tag and extract UMIs: umi_tools extract --extract-method=regex --bc-pattern="^(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8}).*" \ -I /Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_1.fq.gz \ --read2-in=/Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_2.fq.gz \ --stdout=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_extract.fastq.gz \ --read2-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_extract.fastq.gz \ --filtered-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_filtered.fastq.gz \ --filtered-out2=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_filtered.fastq.gz \ -L UMI_extract_fastq/extract.log I then aligned the reads using STAR v.2.7.3a: STAR --runThreadN 4 --genomeDir /t1-data/project/meadlab/rsalisbu/DataBank/GRCh38_Gencode_STAR_2.7.3a/ \ --readFilesIn /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R1_extract.fastq.gz /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R2_extract.fastq.gz \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFileNamePrefix /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/BAM/ \ --outSAMtype BAM SortedByCoordinate Indexed with samtools: samtools index -b ./A3M_Aligned.sorted.out.bam Then ran umi-tools dedup:

UMI-tools version: 1.1.2
output generated by dedup -I ./A3M_Aligned.sortedByCoord.out.bam --paired -S A3M_dedup.bam
job started at Mon Oct 24 17:10:51 2022 on wimmadmins-MacBook-Pro.local -- 41557ad0-8e17-492d-9485-7b8249a806b7
pid: 19498, system: Darwin 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan  5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 x86_64
 assigned_tag                            : None
 cell_tag                                : None
 cell_tag_delim                          : None
 cell_tag_split                          : -
 chimeric_pairs                          : use
 chrom                                   : None
 compresslevel                           : 6
 detection_method                        : None
 filter_umi                              : None
 gene_tag                                : None
 gene_transcript_map                     : None
 get_umi_method                          : read_id
 ignore_tlen                             : False
 ignore_umi                              : False
 in_sam                                  : False
 log2stderr                              : False
 loglevel                                : 1
 mapping_quality                         : 0
 method                                  : directional
 no_sort_output                          : False
 out_sam                                 : False
 output_unmapped                         : False
 paired                                  : True
 per_cell                                : False
 per_contig                              : False
 per_gene                                : False
 random_seed                             : None
 read_length                             : False
 short_help                              : None
 skip_regex                              : ^(__|Unassigned)
 soft_clip_threshold                     : 4
 spliced                                 : False
 stats                                   : False
 stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
 stdin                                   : <_io.TextIOWrapper name='./A3M_Aligned.sortedByCoord.out.bam' mode='r' encoding='UTF-8'>
 stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
 stdout                                  : <_io.TextIOWrapper name='A3M_dedup.bam' mode='w' encoding='UTF-8'>
 subset                                  : None
 threshold                               : 1
 timeit_file                             : None
 timeit_header                           : None
 timeit_name                             : all
 tmpdir                                  : None
 umi_sep                                 : _
 umi_tag                                 : RX
 umi_tag_delim                           : None
 umi_tag_split                           : None
 umi_whitelist                           : None
 umi_whitelist_paired                    : None
 unmapped_reads                          : discard
 unpaired_reads                          : use
 whole_contig                            : False
 2022-10-24 17:10:51,978 WARNING Chimeric read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --chimeric-pairs==discard to discard these reads or --chimeric-pairs==output (group command only) to output them without grouping
 2022-10-24 17:10:51,978 WARNING Unpaired read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --unpared-reads==discard to discard these reads or --unpared-reads==output (group command only) to output them without grouping
 2022-10-24 17:10:51,978 INFO command: dedup -I ./A3M_Aligned.sortedByCoord.out.bam --paired -S A3M_dedup.bam
 Warning: The index file is older than the data file: ./A3M_Aligned.sortedByCoord.out.bam.bai
 Warning: The index file is older than the data file: ./A3M_Aligned.sortedByCoord.out.bam.bai
 2022-10-24 17:11:02,390 INFO Written out 100000 reads
 2022-10-24 17:11:04,648 INFO Written out 200000 reads
 2022-10-24 17:11:09,192 INFO Written out 300000 reads
 2022-10-24 17:11:12,577 INFO Written out 400000 reads
 2022-10-24 17:11:16,372 INFO Written out 500000 reads
 2022-10-24 17:11:19,610 INFO Written out 600000 reads
 2022-10-24 17:11:21,814 INFO Written out 700000 reads
 2022-10-24 17:11:26,313 INFO Written out 800000 reads
 2022-10-24 17:11:29,170 INFO Parsed 1000000 input reads
 2022-10-24 17:11:29,187 INFO Written out 900000 reads
 2022-10-24 17:11:32,976 INFO Written out 1000000 reads
 2022-10-24 17:11:36,336 INFO Written out 1100000 reads
 2022-10-24 17:11:38,612 INFO Written out 1200000 reads
 2022-10-24 17:11:42,794 INFO Written out 1300000 reads
 2022-10-24 17:11:45,534 INFO Written out 1400000 reads
 2022-10-24 17:11:49,139 INFO Written out 1500000 reads
 2022-10-24 17:11:52,712 INFO Written out 1600000 reads
 2022-10-24 17:11:55,055 INFO Written out 1700000 reads
 2022-10-24 17:11:59,239 INFO Written out 1800000 reads
 2022-10-24 17:11:59,292 INFO Parsed 2000000 input reads
 2022-10-24 17:12:01,632 INFO Written out 1900000 reads
 2022-10-24 17:12:05,510 INFO Written out 2000000 reads
 2022-10-24 17:12:08,810 INFO Written out 2100000 reads
 2022-10-24 17:12:12,570 INFO Written out 2200000 reads
 2022-10-24 17:12:14,222 INFO Searching for mates for 0 unmatched alignments
 2022-10-24 17:12:14,222 INFO 0 mates never found
 python: unrecognized option `--no-PG'
 Traceback (most recent call last):
   File "/Users/richard/opt/anaconda3/bin/umi_tools", line 33, in <module>
     sys.exit(load_entry_point('umi-tools==1.1.2', 'console_scripts', 'umi_tools')())
   File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/umi_tools-1.1.2-py3.9-macosx-10.9-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
     module.main(sys.argv)
   File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/umi_tools-1.1.2-py3.9-macosx-10.9-86_64.egg/umi_tools/dedup.py", line 378, in main
     pysam.sort("-o", sorted_out_name, "-O", sort_format, "--no-PG", out_name)
   File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/pysam/utils.py", line 69, in __call__
     raise SamtoolsError(
 pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=Usage: samtools sort [options...] [in.bam]\nOptions:\n  -l INT     Set compression level, from 0 (uncompressed) to 9 (best)\n  -m INT     Set maximum memory per thread; suffix K/M/G recognized [768M]\n  -n         Sort by read name\n  -o FILE    Write final output to FILE rather than standard output\n  -T PREFIX  Write temporary files to PREFIX.nnnn.bam\n  -@, --threads INT\n             Set number of sorting and compression threads [1]\n      --input-fmt-option OPT[=VAL]\n               Specify a single input file format option in the form\n               of OPTION or OPTION=VALUE\n  -O, --output-fmt FORMAT[,OPT[=VAL]]...\n               Specify output format (SAM, BAM, CRAM)\n      --output-fmt-option OPT[=VAL]\n               Specify a single output file format option in the form\n               of OPTION or OPTION=VALUE\n      --reference FILE\n               Reference sequence FASTA FILE [null]\n'

I always seem to get the index file is older than data file warning but this is definitely not true as I run samtools separately after STAR has completed. I've also tried running STAR to produce an unsorted BAM and then sorting/indexing with samtools but get the same error. A3M_dedup.bam gets created but does not contain anything. I've also tried runnin dedup with --ignore-umi but again get the same problem. I'm not sure if the dedup process is finishing completely. I'm running it locally on a macbook with 16GB RAM. Could it be crashing? Here's the head of A3M_Aligned.sortedByCoord.out.bam in case it is useful: head.txt Many thanks in advance, Richard

TomSmithCGAT commented 1 year ago

See https://github.com/CGATOxford/UMI-tools/issues/463. I believe you're hitting the same issue.