CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
472 stars 188 forks source link

UMI-tool 1.1.5 not working with --per-gene --per-contig --gene-transcript-map #646

Open pclavell opened 1 month ago

pclavell commented 1 month ago

Hello, I run this code with UMI-tools 1.0.0 to deduplicate based on UMI+gene mapping (but mapping to a pantranscriptome with several transcripts/gene) and it worked: umi_tools group \ --method adjacency \ --edit-distance-threshold=$EDIT_DISTANCE \ --per-contig \ --per-gene \ --gene-transcript-map gencodev44_transcript_map.tsv \ -I $QUERY \ --group-out "$NAME"_percontig.tsv \ --log "$NAME"_percontig.log

The output in group-out was showing in the gene column the geneID but now it only repeats the transcriptID EDIT: I've just installed version 1.0.0 and it works using exactly the same code and inputs, so there is a problem between 1.0.0 and 1.1.5

IanSudbery commented 1 month ago

Can you include a snippet of your gencodev44_transcript_map.tsv file?

pclavell commented 1 month ago

It is tab separated

ENSG00000290825.1 ENST00000456328.2 ENSG00000223972.6 ENST00000450305.2 ENSG00000227232.5 ENST00000488147.1 ENSG00000278267.1 ENST00000619216.1 ENSG00000243485.5 ENST00000473358.1 ENSG00000243485.5 ENST00000469289.1 ENSG00000284332.1 ENST00000607096.1 ENSG00000237613.2 ENST00000417324.1 ENSG00000237613.2 ENST00000461467.1 ENSG00000268020.3 ENST00000606857.1 ENSG00000290826.1 ENST00000642116.1

TomSmithCGAT commented 3 weeks ago

Ah, I see what's happened here. #577 fixed an issue with group but didn't cover the --gene-transcript-map use case, for which the implications of the fix were not clear to see, and we don't have tests to cover that option either so it wasn't picked up! 🤦

I'll try an issue a patch today/tomorrow.

Note to self: Add switch back to using read tag for gene id when using tx2gene map here: https://github.com/CGATOxford/UMI-tools/blame/9ce3a70a8b35ff9a066d73716680136be71cc70d/umi_tools/group.py#L289-L292. Also add a test to cover!

TomSmithCGAT commented 3 weeks ago

@pclavell - Could you please try installing the ts_debug_issue646 branch to check this resolves the issue. You can install with e.g pip install https://github.com/CGATOxford/UMI-tools/archive/ts_debug_issue646.zip