CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

UMI-tool 1.1.5 not working with --per-gene --per-contig --gene-transcript-map #646

Open pclavell opened 5 months ago

pclavell commented 5 months ago

Hello, I run this code with UMI-tools 1.0.0 to deduplicate based on UMI+gene mapping (but mapping to a pantranscriptome with several transcripts/gene) and it worked: umi_tools group \ --method adjacency \ --edit-distance-threshold=$EDIT_DISTANCE \ --per-contig \ --per-gene \ --gene-transcript-map gencodev44_transcript_map.tsv \ -I $QUERY \ --group-out "$NAME"_percontig.tsv \ --log "$NAME"_percontig.log

The output in group-out was showing in the gene column the geneID but now it only repeats the transcriptID EDIT: I've just installed version 1.0.0 and it works using exactly the same code and inputs, so there is a problem between 1.0.0 and 1.1.5

IanSudbery commented 5 months ago

Can you include a snippet of your gencodev44_transcript_map.tsv file?

pclavell commented 5 months ago

It is tab separated

ENSG00000290825.1 ENST00000456328.2 ENSG00000223972.6 ENST00000450305.2 ENSG00000227232.5 ENST00000488147.1 ENSG00000278267.1 ENST00000619216.1 ENSG00000243485.5 ENST00000473358.1 ENSG00000243485.5 ENST00000469289.1 ENSG00000284332.1 ENST00000607096.1 ENSG00000237613.2 ENST00000417324.1 ENSG00000237613.2 ENST00000461467.1 ENSG00000268020.3 ENST00000606857.1 ENSG00000290826.1 ENST00000642116.1

TomSmithCGAT commented 5 months ago

Ah, I see what's happened here. #577 fixed an issue with group but didn't cover the --gene-transcript-map use case, for which the implications of the fix were not clear to see, and we don't have tests to cover that option either so it wasn't picked up! 🤦

I'll try an issue a patch today/tomorrow.

Note to self: Add switch back to using read tag for gene id when using tx2gene map here: https://github.com/CGATOxford/UMI-tools/blame/9ce3a70a8b35ff9a066d73716680136be71cc70d/umi_tools/group.py#L289-L292. Also add a test to cover!

TomSmithCGAT commented 5 months ago

@pclavell - Could you please try installing the ts_debug_issue646 branch to check this resolves the issue. You can install with e.g pip install https://github.com/CGATOxford/UMI-tools/archive/ts_debug_issue646.zip

IanSudbery commented 3 months ago

Any update on this?

pclavell commented 3 months ago

I'm sorry I missed the last comment. I just ran it with version 1.0.0. This step is now buried in the middle of a snakemake pipeline full of temporary intermediate files and the inputs have been archived so testing this would mean that everything had to be recovered and rerun. If you really need it to be tested I could try doing it in the future weeks, but I am a little bit swamped atm. Thank you