CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Custom UMI Deduplication #605

Closed jamesboot closed 1 year ago

jamesboot commented 1 year ago

Hi there,

Thanks for developing the package. We have some sequencing data, single end reads whereby the first 'n' bases are miRNA sequences, we then have a specified adapter sequence, and then after the adapter we have a UMI, after the UMI we have junk bases to the end of the read. To try and process this I have first run cutadapt using the code below - the idea being to take the adapter and UMI sequence and add the matched sequence to the end of the read name using a '' delimiter. I did this as the UMItools documentation says it is expecting a '' delimiter?

cutadapt -a AACTGTAGGCACCATCAATN{12} \
-o ${OUTNAME}_trim_tag_v2.fastq \
--rename '{id}_{match_sequence}' \
${INFILE}

A quick peak inside the output fastq files and it looks as though the matched sequence has been added correctly. Therefore I've next run STAR to algin reads to our reference genome, options below:

STAR \
--runThreadN ${NSLOTS} \
--genomeDir ${GENOME} \
--sjdbGTFfile ${GTF} \
--alignEndsType EndToEnd \
--outFilterMismatchNmax 1 \
--outFilterMultimapScoreRange 0 \
--quantMode TranscriptomeSAM GeneCounts \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 10 \
--outSAMunmapped Within \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outFilterMatchNmin 16 \
--alignSJDBoverhangMin 1000 \
--alignIntronMax 1 \
--outWigType wiggle \
--outWigStrand Stranded \
--outWigNorm RPM \
--readFilesIn ${READS} \
--outFileNamePrefix ${SUBDIR}_ \
--readFilesCommand zcat

Finally, I've then tried running UMItools on the output.bam files:

umi_tools dedup -I ${DIR}/${SAMPLE}_Aligned.sortedByCoord.out.bam --output-stats=deduplicated -S ${SAMPLE}_deduplicated.bam

However, I'm getting the following output, which looks like no UMIs are being detected?

# job started at Thu Oct 12 08:58:59 2023 on ddy101 -- de8c3b66-9549-4b9e-9668-28e66c09721e
# pid: 42040, system: Linux 3.10.0-1160.83.1.el7.x86_64 #1 SMP Wed Jan 25 16:41:43 UTC 2023 x86_64
# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# filter_umi                              : None
# gene_tag                                : None
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_tlen                             : False
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : False
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stats                                   : deduplicated
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='/data/WHRI-GenomeCentre/shares/Projects/NGS_Projects/RNA_Sequencing/Paredes-Esquivel_Ursula/GC-UP-10453/Analysis/STAR_UMI/T1499_Aligned.sortedByCoord.out.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='T1499_deduplicated.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False
2023-10-12 08:58:59,453 INFO total_umis 0
2023-10-12 08:58:59,453 INFO #umis 0
Traceback (most recent call last):
  File "/data/home/hmy961/.conda/envs/umitools/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/data/home/hmy961/.conda/envs/umitools/lib/python3.10/site-packages/umi_tools/umi_tools.py", line 66, in main
    module.main(sys.argv)
  File "/data/home/hmy961/.conda/envs/umitools/lib/python3.10/site-packages/umi_tools/dedup.py", line 310, in main
    read_gn = umi_methods.random_read_generator(
  File "/data/home/hmy961/.conda/envs/umitools/lib/python3.10/site-packages/umi_tools/umi_methods.py", line 187, in __init__
    self.fill()
  File "/data/home/hmy961/.conda/envs/umitools/lib/python3.10/site-packages/umi_tools/umi_methods.py", line 220, in fill
    self.refill_random()
  File "/data/home/hmy961/.conda/envs/umitools/lib/python3.10/site-packages/umi_tools/umi_methods.py", line 191, in refill_random
    self.random_umis = np.random.choice(
  File "numpy/random/mtrand.pyx", line 950, in numpy.random.mtrand.RandomState.choice
ValueError: 'a' cannot be empty unless no samples are taken

Any help in how best to process these custom libraries and perform the UMI deduplication would be brilliant. Many thanks James

IanSudbery commented 1 year ago

Could you possibly share a few lines of the input BAMs?

jamesboot commented 1 year ago

Hi, thanks for your quick reply. I was about to post a few lines of the BAM input file and I realised the problem, there was some file corruption. Not sure how I missed it... I re-ran the deduplication on some corrected BAM files but I'm now getting a new error:

AssertionError: not all umis are the same length(!):  30 - 31

I can see in the output logs that the UMIs are now being detected but looks like they're not the same length for some reason, so I need to revisit my trimming and adding the UMIs to the header.

Apologies for the inconvenience, I'll close this issue!