still having trouble matching --preset ISOSEQ settings #623

Closed mkostich closed 9 months ago

mkostich commented 9 months ago

Sorry I could not figure out how to reopen issue 622, to which this is directly related.

Operating system Debian GNU/Linux 11 (bullseye)

Package name

> pbmm2 --version
pbmm2 1.13.0
  pbmm2    : 1.13.0 (commit v1.13.0-2-gbcd99f5)
  pbbam    : 2.4.99 (commit v2.4.0-23-g59248fe)
  pbcopper : 2.3.99 (commit v2.3.0-28-ga9b1ffa)
  boost    : 1.81
  htslib   : 1.17
  minimap2 : 2.26
  zlib     : 1.2.13

Describe the bug Comparing:

pbmm2 align -j 32 -u --sort -k 15 -w 5 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000 GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a1.bam


pbmm2 align -j 32 --preset ISOSEQ --sort GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a2.bam

With the explicit parameters I get:

> grep -v '^@' test4a1.sam | cut -d $'\t' -f 1,5,6,16
transcript/47119   30  2S420=140D70=757D24=1X127=659D159=92D119=1X76=177D135=237D139=172D146=206D99=546D231=177S   NM:i:2988
transcript/58376   7   3S114=1X90=1X5=1X218=140D70=757D153=659D162=88D120=1X76=177D100=4S  NM:i:1825
transcript/52355   30  370S92=1X87=560D98=624D71=1X3=1X10=49D34=1X9=1X3=1X90=1X6=1X48=1X156=1X511= NM:i:1243
transcript/28505   60  214S90=1X87=560D98=624D71=1X3=1X10=49D34=1X9=1X3=1X90=1X6=1X48=1X156=1X1258=1312D436=1X216=1X83=2S  NM:i:2557
transcript/51064   60  49=4D8=1X241=1X50=1X535=1X47=1X44=1X705=    NM:i:10

While with --preset ISOSEQ I get:

> grep -v '^@' test4a2.sam | cut -d $'\t' -f 1,5,6,16
transcript/47119   60  423=140N69=757N24=1X127=659N159=92N119=1X78=177N136=237N137=172N147=206N99=546N227=5598N154=4429N25=    NM:i:2
transcript/58376   17  1S116=1X90=1X5=1X219=140N69=757N153=659N159=88N2=1X120=1X78=177N102=    NM:i:5
transcript/52355   57  115=1227N102=18549N153=4120N92=1X91=560N96=624N69=1X3=1X14=49N30=1X9=1X3=1X90=1X6=1X48=1X156=1X511= NM:i:10
transcript/28505   60  1S109=1227N102=22822N92=1X91=560N96=624N69=1X3=1X14=49N30=1X9=1X3=1X90=1X6=1X48=1X156=1X1260=1312N434=1X216=1X85=NM:i:12

Note that edit distance is longer, and mapq is lower for intron-containing alignments with explicit parameters than it is with --preset ISOSEQ. For those sequences, with --preset ISOSEQ we have many insertions in cigar string, while corresponding locations with explicit parameters instead have similar number/position of deletions. For the last sequence the results are identical. One distinguishing feature of the last sequence mapping is that it does not involve large gaps/introns (hint).

As far as I can tell from the documentation, what I can glean from code, and answers previously provided in this forum, the two commands should produce very comparable, if not identical outputs. I would like to figure out what I am doing wrong (am I missing a parameter?), or point out something the code may be doing wrong, or encourage more explicit documentation on this subject. Please note, this behavior has been reproduced with pbmm2 from the official smrttools 11.1 ( and 12.0 ( releases as well.

Error message

The runs complete without reporting an error.

To Reproduce

First command is:

pbmm2 align -j 32 -u --sort -k 15 -w 5 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000 GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a1.bam

Second command is:

pbmm2 align -j 32 --preset ISOSEQ --sort GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a2.bam

GRCh38.primary_assembly.genome.fa can be downloaded here:

The gather.test3.fasta file, which I renamed gather.test3.fasta.txt in order to upload (weird thing w/ my work laptop), which contains 5 sequences, is attached.

Expected behavior

Using the parameter equivalency to presets provided in documentation should result in similar or identical behavior to using the corresponding preset. Thanks for your help!!! gather.test3.fasta.txt

armintoepfer commented 9 months ago

You can't reproduce it without setting the --preset option, as pointed out in 622. The preset sets those minimap2 parameters that aren't exposed to CLI:

But from the code you can also see that you can override the preset defaults: