fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
311 stars 67 forks source link

GroupReadsByUMI sorted input #941

Closed FerriolCalvet closed 11 months ago

FerriolCalvet commented 11 months ago

I am using the fgbio suite of tools for processing some duplex sequencing data, and I am seeing that the GroupReadsByUMI step is among the ones that take more time to run, and it cannot be parallelized. I realised that one of the first things that the program does if it detects that the input is not template-coordinate sorted is to start sorting it, so I decided to sort it outside with samtools, in such a way that I can at least parallelize the sorting procedure. But even after providing the input in this format it still complains and sorts again the entire file... I wonder if there is another way in which I could sort the input file or there is any parameter or anything that influences this.

I am attaching some information:

the header of the BAM file provided as input

@HD     VN:1.6  SO:unsorted     GO:query        SS:unsorted:template-coordinate
...
@PG     ID:samtools.4   PN:samtools     PP:samtools.3   VN:1.16.1       CL:samtools sort --template-coordinate --write-index -@ 4 -o K_16_1_A_1.sorted.bam -T K_16_1_A_1.sorted K_16_1_A_1.filtered.bam

the command I use to run GroupReadsByUMI

fgbio \
    -Xmx8g \
    --tmp-dir=. \
    GroupReadsByUmi \
    --edits 1                     --min-map-q 10 \
    --strategy Paired \
    --input K_16_1_A_1.sorted.bam \
    --output K_16_1_A_1_umi-grouped.bam \
    --family-size-histogram K_16_1_A_1_umi_histogram.txt

the first lines of the log file

[2023/10/01 05:53:38 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 2.1.0 as fcalvet...
[2023/10/01 05:53:38 | GroupReadsByUmi | Info] Filtering the input.
[2023/10/01 05:53:39 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2023/10/01 05:53:39 | SamWriter | Info] Seen many non-increasing record positions. Printing Read-names as well.
[2023/10/01 05:54:09 | SamWriter | Info] Wrote      2,000,000 records.  Elapsed time: 00:00:30s.  Time for last 2,000,000:   30s.  Last read position: chr1:7,759,584.  Last read name: A01001:442:H7HGGDSX5:2:2564:18991:33098
[2023/10/01 05:54:36 | SamWriter | Info] Wrote      4,000,000 records.  Elapsed time: 00:00:57s.  Time for last 2,000,000:   27s.  Last read position: chr1:11,107,412.  Last read name: A01001:442:H7HGGDSX5:2:1349:20509:14810
[2023/10/01 05:54:58 | SamWriter | Info] Wrote      6,000,000 records.  Elapsed time: 00:01:19s.  Time for last 2,000,000:   21s.  Last read position: chr1:11,108,064.  Last read name: A01001:442:H7HGGDSX5:4:1244:15085:7106
nh13 commented 11 months ago

@FerriolCalvet my apologies if I am missing it, but I don't see any sorting message in the logs. Can you show me where that's occuring?

FerriolCalvet commented 11 months ago

Sorry I saw that it had the same format as in a previous case where I provided an unsorted input, and I was a bit confused. But I rechecked that again and now I see that it is directly writing it without sorting. Apologies for the confusion, and thank you for your reply!