CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

umit-tools count ends with TypeError: '<' not supported between instances of 'int' and 'str' #485

Closed yuifu closed 3 years ago

yuifu commented 3 years ago

Hi! Thank you for developing UMI-tools!

I encountered TypeError: '<' not supported between instances of 'int' and 'str' while using the count subcommand.

The command:

umi_tools count \
  --extract-umi-method=tag --umi-tag=RX \
  --paired \
  --per-gene --gene-tag=XT --assigned-status-tag=XS \
  -I test_umi-tools/UMTR_mES010_RDX_T18NS_A10R15_B50U75_P5P7Uc17_UN05_merged.bam.assigned_sorted.bam \
  -S test_umi-tools/counts.tsv.gz

The message says the error occurred in sam_methods.py :

# UMI-tools version: 1.1.1
# output generated by count --extract-umi-method=tag --umi-tag=RX --per-gene --gene-tag=XT --assigned-status-tag=XS -I test_umi-tools/UMTR_mES010_RDX_T18NS_A10R15_B50U75_P5P7Uc17
_UN05_merged.bam.assigned_sorted.bam -S test_umi-tools/counts.tsv.gz
# job started at Tue Jul 20 00:24:03 2021 on 83b9bbee8942 -- fce493b0-a2fd-4e27-b6a7-059d3792186f
# pid: 7, system: Linux 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64
# assigned_tag                            : XS
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# filter_umi                              : None
# gene_tag                                : XT
# gene_transcript_map                     : None
# get_umi_method                          : tag
# ignore_tlen                             : False
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_unmapped                         : False
# paired                                  : False
# per_cell                                : False
# per_contig                              : False
# per_gene                                : True
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='test_umi-tools/UMTR_mES010_RDX_T18NS_A10R15_B50U75_P5P7Uc17_UN05_merged.bam.assigned_sorted.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='test_umi-tools/counts.tsv.gz' encoding='ascii'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# wide_format_cell_counts                 : False
2021-07-20 00:24:03,766 INFO command: count --extract-umi-method=tag --umi-tag=RX --per-gene --gene-tag=XT --assigned-status-tag=XS -I test_umi-tools/UMTR_mES010_RDX_T18NS_A10R15_B50U75_P5P7Uc17_UN05_merged.bam.assigned_sorted.bam -S test_umi-tools/counts.tsv.gz
Traceback (most recent call last):
  File "/usr/local/bin/umi_tools", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/umi_tools/umi_tools.py", line 61, in main
    module.main(sys.argv)
  File "/usr/local/lib/python3.8/site-packages/umi_tools/count.py", line 143, in main
    for bundle, key, status in bundle_iterator(inreads):
  File "/usr/local/lib/python3.8/site-packages/umi_tools/sam_methods.py", line 465, in __call__
    do_output, out_keys = self.check_output()
  File "/usr/local/lib/python3.8/site-packages/umi_tools/sam_methods.py", line 292, in check_output
    out_keys = sorted(self.reads_dict.keys())
TypeError: '<' not supported between instances of 'int' and 'str'

My BAM file looks like below:

210308_NS500723_0175_AH72NMBGXH:3:21608:19611:2819      97      chr1    957169  60      70M     =       957180  81      AGTAGGCTCCGGTCATTCTCCTGCAGGAACTTGTAGAACTCGGGGTCTCTGTCCTTCAGCCGAGAGAGCT    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEAEEEAEEEEEEEEE  MC:Z:70M        MD:Z:2C6T60     RG:Z:21030.3    NH:i:1  NM:i:2  MQ:i:60 UQ:i:68 AS:i:-10 QX:Z:E6EE/AA/EE  RX:Z:GTAGCAGTGG XS:Z:Assigned   XN:i:1  XT:Z:ENSG00000188976.11
210308_NS500723_0175_AH72NMBGXH:1:21310:5705:15260      101     chr1    957173  0       *       =       957173  0       GTCCAGTAGGCTCCGGTCATTCTCCTGCAGGAACTTGTAGAACTCGGGGTCTCTGTCCTTCAGCCGAGAG    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE  MC:Z:70M        RG:Z:21030.1    MQ:i:60 QX:Z:EE6AAEAEEA RX:Z:CCCTGGGGTA XS:Z:Unassigned_Unmapped
210308_NS500723_0175_AH72NMBGXH:1:21310:5705:15260      153     chr1    957173  60      70M     =       957173  0       GGCTCCGGTCATTCTCCTGCAGGAACTTGTAGAACTCGGGGTCTCTGTCCTTCAGCCGAGAGAGCTGGTC    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA  MD:Z:5T64       RG:Z:21030.1    NH:i:1  NM:i:1  UQ:i:36 AS:i:-5 QX:Z:EE6AAEAEEA RX:Z:CCCTGGGGTA   XS:Z:Assigned   XN:i:1  XT:Z:ENSG00000188976.11

Is there any way I can handle the above error? (This may not be relevant, but when I made a mistake and did not --gene-tag=RX, UMIs were counted for each RX. I don't know why.)

IanSudbery commented 3 years ago

Do you know how far count gets before it gives this error? I can't immediately see anything wrong.

yuifu commented 3 years ago

@IanSudbery Thank you for your reply. It turns out that umit-tools has nothing to do with it.

This is probably due to the fact that the XT tag was used twice (one for integer, the other for string) in some of the records in the BAM file. This was derived from a conflict between the Picard and FeatureCounts specifications.

I apologize for the confusion.