Dear Ian,
I'm using umi-tools on some libraries I have created using smartseq 3. These have been created using low cell number input rather than single cell so I have demultiplexed to create 25 fastq files. Smartseq3 inserts a 11bp tag, followed by an 8bp UMI using the TSO primer. The cDNA is pre-amplified, cut up using Tn5 and then amplified using nextera index primers (dual index). Libraries were pooled and sequenced 75bp paired end with additional index reads so cell barcodes are therefore not in the processed fastq files.
I ran trim_galore to filter adapters/low quality bp, then umi-tools extract to find reads containing the tag and extract UMIs:
umi_tools extract --extract-method=regex --bc-pattern="^(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8}).*" \ -I /Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_1.fq.gz \ --read2-in=/Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_2.fq.gz \ --stdout=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_extract.fastq.gz \ --read2-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_extract.fastq.gz \ --filtered-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_filtered.fastq.gz \ --filtered-out2=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_filtered.fastq.gz \ -L UMI_extract_fastq/extract.log
I then aligned the reads using STAR v.2.7.3a:
STAR --runThreadN 4 --genomeDir /t1-data/project/meadlab/rsalisbu/DataBank/GRCh38_Gencode_STAR_2.7.3a/ \ --readFilesIn /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R1_extract.fastq.gz /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R2_extract.fastq.gz \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFileNamePrefix /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/BAM/ \ --outSAMtype BAM SortedByCoordinate
Indexed with samtools:
samtools index -b ./A3M_Aligned.sorted.out.bam
Then ran umi-tools dedup:
UMI-tools version: 1.1.2
output generated by dedup -I ./A3M_Aligned.sortedByCoord.out.bam --paired -S A3M_dedup.bam
job started at Mon Oct 24 17:10:51 2022 on wimmadmins-MacBook-Pro.local -- 41557ad0-8e17-492d-9485-7b8249a806b7
pid: 19498, system: Darwin 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000 x86_64
assigned_tag : None
cell_tag : None
cell_tag_delim : None
cell_tag_split : -
chimeric_pairs : use
chrom : None
compresslevel : 6
detection_method : None
filter_umi : None
gene_tag : None
gene_transcript_map : None
get_umi_method : read_id
ignore_tlen : False
ignore_umi : False
in_sam : False
log2stderr : False
loglevel : 1
mapping_quality : 0
method : directional
no_sort_output : False
out_sam : False
output_unmapped : False
paired : True
per_cell : False
per_contig : False
per_gene : False
random_seed : None
read_length : False
short_help : None
skip_regex : ^(__|Unassigned)
soft_clip_threshold : 4
spliced : False
stats : False
stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin : <_io.TextIOWrapper name='./A3M_Aligned.sortedByCoord.out.bam' mode='r' encoding='UTF-8'>
stdlog : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout : <_io.TextIOWrapper name='A3M_dedup.bam' mode='w' encoding='UTF-8'>
subset : None
threshold : 1
timeit_file : None
timeit_header : None
timeit_name : all
tmpdir : None
umi_sep : _
umi_tag : RX
umi_tag_delim : None
umi_tag_split : None
umi_whitelist : None
umi_whitelist_paired : None
unmapped_reads : discard
unpaired_reads : use
whole_contig : False
2022-10-24 17:10:51,978 WARNING Chimeric read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --chimeric-pairs==discard to discard these reads or --chimeric-pairs==output (group command only) to output them without grouping
2022-10-24 17:10:51,978 WARNING Unpaired read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --unpared-reads==discard to discard these reads or --unpared-reads==output (group command only) to output them without grouping
2022-10-24 17:10:51,978 INFO command: dedup -I ./A3M_Aligned.sortedByCoord.out.bam --paired -S A3M_dedup.bam
Warning: The index file is older than the data file: ./A3M_Aligned.sortedByCoord.out.bam.bai
Warning: The index file is older than the data file: ./A3M_Aligned.sortedByCoord.out.bam.bai
2022-10-24 17:11:02,390 INFO Written out 100000 reads
2022-10-24 17:11:04,648 INFO Written out 200000 reads
2022-10-24 17:11:09,192 INFO Written out 300000 reads
2022-10-24 17:11:12,577 INFO Written out 400000 reads
2022-10-24 17:11:16,372 INFO Written out 500000 reads
2022-10-24 17:11:19,610 INFO Written out 600000 reads
2022-10-24 17:11:21,814 INFO Written out 700000 reads
2022-10-24 17:11:26,313 INFO Written out 800000 reads
2022-10-24 17:11:29,170 INFO Parsed 1000000 input reads
2022-10-24 17:11:29,187 INFO Written out 900000 reads
2022-10-24 17:11:32,976 INFO Written out 1000000 reads
2022-10-24 17:11:36,336 INFO Written out 1100000 reads
2022-10-24 17:11:38,612 INFO Written out 1200000 reads
2022-10-24 17:11:42,794 INFO Written out 1300000 reads
2022-10-24 17:11:45,534 INFO Written out 1400000 reads
2022-10-24 17:11:49,139 INFO Written out 1500000 reads
2022-10-24 17:11:52,712 INFO Written out 1600000 reads
2022-10-24 17:11:55,055 INFO Written out 1700000 reads
2022-10-24 17:11:59,239 INFO Written out 1800000 reads
2022-10-24 17:11:59,292 INFO Parsed 2000000 input reads
2022-10-24 17:12:01,632 INFO Written out 1900000 reads
2022-10-24 17:12:05,510 INFO Written out 2000000 reads
2022-10-24 17:12:08,810 INFO Written out 2100000 reads
2022-10-24 17:12:12,570 INFO Written out 2200000 reads
2022-10-24 17:12:14,222 INFO Searching for mates for 0 unmatched alignments
2022-10-24 17:12:14,222 INFO 0 mates never found
python: unrecognized option `--no-PG'
Traceback (most recent call last):
File "/Users/richard/opt/anaconda3/bin/umi_tools", line 33, in <module>
sys.exit(load_entry_point('umi-tools==1.1.2', 'console_scripts', 'umi_tools')())
File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/umi_tools-1.1.2-py3.9-macosx-10.9-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
module.main(sys.argv)
File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/umi_tools-1.1.2-py3.9-macosx-10.9-86_64.egg/umi_tools/dedup.py", line 378, in main
pysam.sort("-o", sorted_out_name, "-O", sort_format, "--no-PG", out_name)
File "/Users/richard/opt/anaconda3/lib/python3.9/site-packages/pysam/utils.py", line 69, in __call__
raise SamtoolsError(
pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=Usage: samtools sort [options...] [in.bam]\nOptions:\n -l INT Set compression level, from 0 (uncompressed) to 9 (best)\n -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]\n -n Sort by read name\n -o FILE Write final output to FILE rather than standard output\n -T PREFIX Write temporary files to PREFIX.nnnn.bam\n -@, --threads INT\n Set number of sorting and compression threads [1]\n --input-fmt-option OPT[=VAL]\n Specify a single input file format option in the form\n of OPTION or OPTION=VALUE\n -O, --output-fmt FORMAT[,OPT[=VAL]]...\n Specify output format (SAM, BAM, CRAM)\n --output-fmt-option OPT[=VAL]\n Specify a single output file format option in the form\n of OPTION or OPTION=VALUE\n --reference FILE\n Reference sequence FASTA FILE [null]\n'
I always seem to get the index file is older than data file warning but this is definitely not true as I run samtools separately after STAR has completed. I've also tried running STAR to produce an unsorted BAM and then sorting/indexing with samtools but get the same error. A3M_dedup.bam gets created but does not contain anything. I've also tried runnin dedup with --ignore-umi but again get the same problem.
I'm not sure if the dedup process is finishing completely. I'm running it locally on a macbook with 16GB RAM. Could it be crashing?
Here's the head of A3M_Aligned.sortedByCoord.out.bam in case it is useful:
head.txt
Many thanks in advance,
Richard
Dear Ian, I'm using umi-tools on some libraries I have created using smartseq 3. These have been created using low cell number input rather than single cell so I have demultiplexed to create 25 fastq files. Smartseq3 inserts a 11bp tag, followed by an 8bp UMI using the TSO primer. The cDNA is pre-amplified, cut up using Tn5 and then amplified using nextera index primers (dual index). Libraries were pooled and sequenced 75bp paired end with additional index reads so cell barcodes are therefore not in the processed fastq files. I ran trim_galore to filter adapters/low quality bp, then umi-tools extract to find reads containing the tag and extract UMIs:
umi_tools extract --extract-method=regex --bc-pattern="^(?P<discard_1>ATTGCGCAATG)(?P<umi_1>.{8}).*" \ -I /Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_1.fq.gz \ --read2-in=/Users/richard/SS3_analysis/Mono/fastq_trimmed/A3M_val_2.fq.gz \ --stdout=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_extract.fastq.gz \ --read2-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_extract.fastq.gz \ --filtered-out=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R1_filtered.fastq.gz \ --filtered-out2=/Users/richard/SS3_analysis/Mono/UMI_extract_fastq/A3M_R2_filtered.fastq.gz \ -L UMI_extract_fastq/extract.log
I then aligned the reads using STAR v.2.7.3a:STAR --runThreadN 4 --genomeDir /t1-data/project/meadlab/rsalisbu/DataBank/GRCh38_Gencode_STAR_2.7.3a/ \ --readFilesIn /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R1_extract.fastq.gz /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/umi_extract_fastq/A3M_R2_extract.fastq.gz \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFileNamePrefix /t1-data/project/meadlab/rsalisbu/minibulk_multi/SS3_analysis/mono/BAM/ \ --outSAMtype BAM SortedByCoordinate
Indexed with samtools:samtools index -b ./A3M_Aligned.sorted.out.bam
Then ran umi-tools dedup:I always seem to get the index file is older than data file warning but this is definitely not true as I run samtools separately after STAR has completed. I've also tried running STAR to produce an unsorted BAM and then sorting/indexing with samtools but get the same error. A3M_dedup.bam gets created but does not contain anything. I've also tried runnin dedup with --ignore-umi but again get the same problem. I'm not sure if the dedup process is finishing completely. I'm running it locally on a macbook with 16GB RAM. Could it be crashing? Here's the head of A3M_Aligned.sortedByCoord.out.bam in case it is useful: head.txt Many thanks in advance, Richard