PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

trouble matching --preset ISOSEQ settings #622

Closed mkostich closed 9 months ago

mkostich commented 9 months ago

Operating system Debian GNU/Linux 11 (bullseye)

Package name

> pbmm2 --version
pbmm2 1.13.0
Using:
  pbmm2    : 1.13.0 (commit v1.13.0-2-gbcd99f5)
  pbbam    : 2.4.99 (commit v2.4.0-23-g59248fe)
  pbcopper : 2.3.99 (commit v2.3.0-28-ga9b1ffa)
  boost    : 1.81
  htslib   : 1.17
  minimap2 : 2.26
  zlib     : 1.2.13

Conda environment

> conda list
# packages in environment at /opt/conda:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                  2_kmp_llvm    conda-forge
bam2fastx                 3.0.0                h9ee0642_0    bioconda
bax2bam                   0.0.11                        0    bioconda
boltons                   23.0.0             pyhd8ed1ab_0    conda-forge
brotlipy                  0.7.0           py39h27cfd23_1003
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.19.1               hd590300_0    conda-forge
ca-certificates           2023.7.22            hbcca054_0    conda-forge
certifi                   2023.7.22          pyhd8ed1ab_0    conda-forge
cffi                      1.15.0           py39hd667e15_1
charset-normalizer        2.0.4              pyhd3eb1b0_0
colorama                  0.4.4              pyhd3eb1b0_0
conda                     23.7.4           py39hf3d152e_0    conda-forge
conda-content-trust       0.1.1              pyhd3eb1b0_0
conda-package-handling    1.8.1            py39h7f8727e_0
cryptography              36.0.0           py39h9ce1e76_0
htslib                    1.17                 h6bc39ce_1    bioconda
idna                      3.3                pyhd3eb1b0_0
isoseq                    4.0.0                h9ee0642_0    bioconda
isoseq3                   4.0.0                h9ee0642_0    bioconda
jsonpatch                 1.32               pyhd8ed1ab_0    conda-forge
jsonpointer               2.4              py39hf3d152e_2    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
ld_impl_linux-64          2.35.1               h7274673_9
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdeflate                1.19                 hd590300_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  he6710b0_2
libgcc-ng                 13.2.0               h807b86a_2    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_2    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
lima                      2.7.1                h9ee0642_0    bioconda
llvm-openmp               16.0.6               h4dfa4b3_0    conda-forge
ncurses                   6.3                  h7f8727e_2
openssl                   1.1.1w               hd590300_0    conda-forge
packaging                 23.1               pyhd8ed1ab_0    conda-forge
pbbam                     2.4.0                h8db2425_0    bioconda
pbccs                     6.4.0                h9ee0642_0    bioconda
pbcopper                  2.3.0                hfce7173_0    bioconda
pbmm2                     1.13.0               h9ee0642_0    bioconda
pbtk                      3.1.0                h9ee0642_0    bioconda
pip                       21.2.4           py39h06a4308_0
pluggy                    1.3.0              pyhd8ed1ab_0    conda-forge
pycosat                   0.6.3            py39h27cfd23_0
pycparser                 2.21               pyhd3eb1b0_0
pyopenssl                 22.0.0             pyhd3eb1b0_0
pysocks                   1.7.1            py39h06a4308_0
python                    3.9.12               h12debd9_0
python_abi                3.9                      2_cp39    conda-forge
readline                  8.1.2                h7f8727e_1
requests                  2.27.1             pyhd3eb1b0_0
ruamel.yaml               0.17.32          py39hd1e30aa_0    conda-forge
ruamel.yaml.clib          0.2.7            py39h72bdee0_1    conda-forge
ruamel_yaml               0.15.100         py39h27cfd23_0
setuptools                61.2.0           py39h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.38.2               hc218d9a_0
tk                        8.6.11               h1ccaba5_0
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tqdm                      4.63.0             pyhd3eb1b0_0
tzdata                    2022a                hda174b7_0
urllib3                   1.26.8             pyhd3eb1b0_0
wheel                     0.37.1             pyhd3eb1b0_0
xz                        5.2.6                h166bdaf_0    conda-forge
yaml                      0.2.5                h7b6447c_0
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hfc55251_0    conda-forge

Describe the bug Comparing:

pbmm2 align -j 16 -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5\
 -G 200000 --sort genome.fasta reads.hifi.bam output.bam

to:

pbmm2 align -j $ncpus --preset ISOSEQ --sort genome.fasta reads.hifi.bam output.bam

I matched equivalent parameters per the help returned from pbmm2 align --help:

ISOSEQ : -k 15 -w 5  -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000

But I find the results are very different. This should produce a gapped alignment (cDNA vs genomic assembly), and with --preset ISOSEQ I see many N (skipped) in cigar string, as expected. MAPQ and edit distance look good. However, with the explicit parameter set (no --preset used, but the -k 15 -w 5 ... instead), I see similar CIGAR, but with Ds (deletions) in place of N, much lower MAPQ and much larger edit distances. A typical example of output with --preset ISOSEQ:

transcript/47119        16      chr1    14407   60\
      423=140N69=757N24=1X127=659N159=92N119=1X78=177N136=237N137=172N147=206N99=546N227=5598N154=4429N25=\
    *       0       14939\
   CTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGG\
...
AGGAGTCATGGTGCCTGCGAGCCGCCCTCCCGGAAGCTC  *       ib:Z:0,2,2\
      im:Z:m64039_201204_055021/60163887/ccs,m64039_201204_055021/158862130/ccs       is:i:2  ic:i:56\
 zm:i:47119      RG:Z:e4927d21   mg:f:99.8962    NM:i:2

While with the explicit parameter list (no --preset), the same transcript/genome combo yields:

transcript/47119        16      chr1    14409   30\
      2S420=140D70=757D24=1X127=659D159=92D119=1X76=177D135=237D139=172D146=206D99=546D231=177S\
       *       0       4733\
   CTGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGG\
...
AGGAGTCATGGTGCCTGCGAGCCGCCCTCCCGGAAGCTC  *       ib:Z:0,2,2\
      im:Z:m64039_201204_055021/60163887/ccs,m64039_201204_055021/158862130/ccs       is:i:2  ic:i:56\
 zm:i:47119      RG:Z:e4927d21   mg:f:99.3736    NM:i:2988

Please note the very similar CIGARs, except the D in place of N in the second case.

Are the parameters I'm passing incorrect or missing something?

Error message Completes without complaint; just many fewer decent hits returned.

To Reproduce Steps to reproduce the behavior. Providing a minimal test dataset on which we can reproduce the behavior will generally lead to quicker turnaround time!

Please try with test data of your choice. I've seen it with various HiFi cDNA datasets, from both mouse and human, so don't think there is anything particular to my data.

Expected behavior A clear and concise description of what you expected to happen.

Expect very similar results, given the equivalency of the --preset ISOSEQ option listed by pbmm2 align --help. Am I missing something?

Thanks for your time!

mkostich commented 9 months ago

BTW, tried w/ https://downloads.pacbcloud.com/public/software/installers/smrtlink_11.1.0.166339.zip and the equivalent parameters listed by the pbmm2 align --help there, which differs in that it adds -L 0.5 to --preset ISOSEQ parameter equivalence list. So command: pbmm2 align -j 16 -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 200000 -g 2000 -C 5 -G 200000 -L 0.5 --sort genome.fasta reads.hifi.bam mmap.bam and, unfortunately, find the same issue. I suspect the Ds vs Ns in CIGAR (mentioned in original issue report) may be a hint. Thanks for any help you can provide.

armintoepfer commented 9 months ago

Why are you trying to manually configure pbmm2 instead of using --preset? The source code is available, so you can go and figure it out that it's due to this line

mkostich commented 9 months ago

Thanks for your help Armin. Users of a pipeline incorporating pbmm2 asked for the option to change some parameters to be able to better match very small exons. They are sophisticated users who need to be able to make adjustments for specific cases. This similar situation can be handled using minimap2 by tweaking settings. So I exposed the pbmm2 mapping parameters in the option file for the pipeline, hoping to enable the same performance, but wanted to keep the default settings for those parameters equivalent to --preset ISOSEQ, because that is the option I used until receiving the user request. Unfortunately my attempt to set the defaults by following the documentation failed testing. I took a quick look at the code but could not find the right line, and I could not afford to spend too much time looking for it. I hope you don't mind me reporting the mismatch between the documentation and the actual behavior (might benefit others as well), or asking for some guidance. In any case, thanks again for the help. I really do appreciate it. Best regards, Mitch