CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
474 stars 188 forks source link

Output of `umi_tools extract` not compatible with `umi_tools count_tab` #651

Open eachanjohnson opened 3 weeks ago

eachanjohnson commented 3 weeks ago

The command umi_tools extract results in read names being suffixed with the pattern _[cell barcode]_[UMI]. See the docs here for an example.

However, umi_tools count_tab expects read names suffixed with the pattern _[UMI]_[cell barcode]. See the docs here.

As a result, pipelines naively expecting to use the output of umi_tools extract for umi_tools count_tab (after e.g. a cut | sort manipulation) will have incorrect output.

This does not seem to be simply a documentation error. On this line, umi_tools count_tab counts the barcodes using sam_methods.get_gene_count_tab(), which by default uses the sam_methods.get_cell_umi_read_string() function, returning the tuple (read_id.split(sep)[-1].encode('utf-8'), read_id.split(sep)[-2].encode('utf-8')). For the output read names from extract, this corresponds to (UMI, cell barcode). But then this output is unpacked here as cell, umi = bc_getter(read_id). So the cell barcode and UMI are swapped around.

Apologies if I've missed a step, and this behaviour is intended. I thought I should point it out to save others some trouble in future.

IanSudbery commented 2 weeks ago

Thanks for this, it does indeed seem that you are correct. @TomSmithCGAT - any thoughts? Did we swtich the order at somepoint and forget to propogate through to count_tab?