PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

still having trouble matching --preset ISOSEQ settings #623

Closed mkostich closed 9 months ago

mkostich commented 9 months ago

Sorry I could not figure out how to reopen issue 622, to which this is directly related.

Operating system Debian GNU/Linux 11 (bullseye)

Package name

> pbmm2 --version
pbmm2 1.13.0
Using:
  pbmm2    : 1.13.0 (commit v1.13.0-2-gbcd99f5)
  pbbam    : 2.4.99 (commit v2.4.0-23-g59248fe)
  pbcopper : 2.3.99 (commit v2.3.0-28-ga9b1ffa)
  boost    : 1.81
  htslib   : 1.17
  minimap2 : 2.26
  zlib     : 1.2.13

Conda environment

> conda list
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
bam2fastx                 3.0.0                h9ee0642_0    bioconda
bax2bam                   0.0.11                        0    bioconda
boltons                   23.0.0             pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py39h27cfd23_1003
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.19.1               hd590300_0    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
certifi                   2023.7.22          pyhd8ed1ab_0    conda-forge
cffi                      1.15.0           py39hd667e15_1
charset-normalizer        2.0.4              pyhd3eb1b0_0
colorama                  0.4.4              pyhd3eb1b0_0
conda                     23.7.4           py39hf3d152e_0    conda-forge
conda-content-trust       0.1.1              pyhd3eb1b0_0
conda-package-handling    1.8.1            py39h7f8727e_0
cryptography              36.0.0           py39h9ce1e76_0
htslib                    1.17                 h6bc39ce_1    bioconda
idna                      3.3                pyhd3eb1b0_0
isoseq                    4.0.0                h9ee0642_0    bioconda
isoseq3                   4.0.0                h9ee0642_0    bioconda
jsonpatch                 1.32               pyhd8ed1ab_0    conda-forge
jsonpointer               2.4              py39hf3d152e_2    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
ld_impl_linux-64          2.35.1               h7274673_9
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdeflate                1.19                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  he6710b0_2
libgcc-ng                 13.2.0               h807b86a_2    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_2    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
lima                      2.7.1                h9ee0642_0    bioconda
llvm-openmp               16.0.6               h4dfa4b3_0    conda-forge
ncurses                   6.3                  h7f8727e_2
openssl                   1.1.1w               hd590300_0    conda-forge
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pbbam                     2.4.0                h8db2425_0    bioconda
pbccs                     6.4.0                h9ee0642_0    bioconda
pbcopper                  2.3.0                hfce7173_0    bioconda
pbmm2                     1.13.0               h9ee0642_0    bioconda
pbtk                      3.1.0                h9ee0642_0    bioconda
pip                       21.2.4           py39h06a4308_0
pluggy                    1.3.0              pyhd8ed1ab_0    conda-forge
pycosat                   0.6.3            py39h27cfd23_0
pycparser                 2.21               pyhd3eb1b0_0
pyopenssl                 22.0.0             pyhd3eb1b0_0
pysocks                   1.7.1            py39h06a4308_0
python                    3.9.12               h12debd9_0
python_abi                3.9                      2_cp39    conda-forge
readline                  8.1.2                h7f8727e_1
requests                  2.27.1             pyhd3eb1b0_0
ruamel.yaml               0.17.32          py39hd1e30aa_0    conda-forge
ruamel.yaml.clib          0.2.7            py39h72bdee0_1    conda-forge
ruamel_yaml               0.15.100         py39h27cfd23_0
setuptools                61.2.0           py39h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.38.2               hc218d9a_0
tk                        8.6.11               h1ccaba5_0
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tqdm                      4.63.0             pyhd3eb1b0_0
tzdata                    2022a                hda174b7_0
urllib3                   1.26.8             pyhd3eb1b0_0
wheel                     0.37.1             pyhd3eb1b0_0
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7b6447c_0
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge

Describe the bug Comparing:

pbmm2 align -j 32 -u --sort -k 15 -w 5 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000 GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a1.bam

To:

pbmm2 align -j 32 --preset ISOSEQ --sort GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a2.bam

With the explicit parameters I get:

> grep -v '^@' test4a1.sam | cut -d $'\t' -f 1,5,6,16
transcript/47119   30  2S420=140D70=757D24=1X127=659D159=92D119=1X76=177D135=237D139=172D146=206D99=546D231=177S   NM:i:2988
transcript/58376   7   3S114=1X90=1X5=1X218=140D70=757D153=659D162=88D120=1X76=177D100=4S  NM:i:1825
transcript/52355   30  370S92=1X87=560D98=624D71=1X3=1X10=49D34=1X9=1X3=1X90=1X6=1X48=1X156=1X511= NM:i:1243
transcript/28505   60  214S90=1X87=560D98=624D71=1X3=1X10=49D34=1X9=1X3=1X90=1X6=1X48=1X156=1X1258=1312D436=1X216=1X83=2S  NM:i:2557
transcript/51064   60  49=4D8=1X241=1X50=1X535=1X47=1X44=1X705=    NM:i:10

While with --preset ISOSEQ I get:

> grep -v '^@' test4a2.sam | cut -d $'\t' -f 1,5,6,16
transcript/47119   60  423=140N69=757N24=1X127=659N159=92N119=1X78=177N136=237N137=172N147=206N99=546N227=5598N154=4429N25=    NM:i:2
transcript/58376   17  1S116=1X90=1X5=1X219=140N69=757N153=659N159=88N2=1X120=1X78=177N102=    NM:i:5
transcript/52355   57  115=1227N102=18549N153=4120N92=1X91=560N96=624N69=1X3=1X14=49N30=1X9=1X3=1X90=1X6=1X48=1X156=1X511= NM:i:10
transcript/28505   60  1S109=1227N102=22822N92=1X91=560N96=624N69=1X3=1X14=49N30=1X9=1X3=1X90=1X6=1X48=1X156=1X1260=1312N434=1X216=1X85=NM:i:12

Note that edit distance is longer, and mapq is lower for intron-containing alignments with explicit parameters than it is with --preset ISOSEQ. For those sequences, with --preset ISOSEQ we have many insertions in cigar string, while corresponding locations with explicit parameters instead have similar number/position of deletions. For the last sequence the results are identical. One distinguishing feature of the last sequence mapping is that it does not involve large gaps/introns (hint).

As far as I can tell from the documentation, what I can glean from code, and answers previously provided in this forum, the two commands should produce very comparable, if not identical outputs. I would like to figure out what I am doing wrong (am I missing a parameter?), or point out something the code may be doing wrong, or encourage more explicit documentation on this subject. Please note, this behavior has been reproduced with pbmm2 from the official smrttools 11.1 (https://downloads.pacbcloud.com/public/software/installers/smrtlink_11.1.0.166339.zip) and 12.0 (https://downloads.pacbcloud.com/public/software/installers/smrtlink_12.0.0.177059.zip) releases as well.

Error message

The runs complete without reporting an error.

To Reproduce

First command is:

pbmm2 align -j 32 -u --sort -k 15 -w 5 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000 GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a1.bam

Second command is:

pbmm2 align -j 32 --preset ISOSEQ --sort GRCh38.primary_assembly.genome.fa gather.test3.fasta test4a2.bam

GRCh38.primary_assembly.genome.fa can be downloaded here: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz.

The gather.test3.fasta file, which I renamed gather.test3.fasta.txt in order to upload (weird thing w/ my work laptop), which contains 5 sequences, is attached.

Expected behavior

Using the parameter equivalency to presets provided in documentation should result in similar or identical behavior to using the corresponding preset. Thanks for your help!!! gather.test3.fasta.txt

armintoepfer commented 9 months ago

You can't reproduce it without setting the --preset option, as pointed out in 622. The preset sets those minimap2 parameters that aren't exposed to CLI: https://github.com/PacificBiosciences/pbmm2/blob/f7cbb14ababd959a29fa8410f5e79876111768d4/src/MM2Helper.cpp#L300

But from the code you can also see that you can override the preset defaults: https://github.com/PacificBiosciences/pbmm2/blob/f7cbb14ababd959a29fa8410f5e79876111768d4/src/MM2Helper.cpp#L335-L361