PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
249 stars 44 forks source link

isoseq3 refine filters 99.9% of reads due to concatemer detection #569

Closed mpmargolis closed 1 year ago

mpmargolis commented 1 year ago

Summary isoseq3 refine step is filtering 99% of lima reads, though manual inspection of the lima .bam file indicates that the reads have little to no internal priming or concatemerization and the vast majority of reads have polyA tails. There are no barcodes to remove so only the primers.fa file is used as accessory input to refine. Please let me know what I can do to resolve the issue, thank you in advance!

Operating system CentOS Linux 7 (Core)

Package name isoseq3 refine

Conda environment

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bam2fastx                 1.3.0                he1c1bb9_8    bioconda
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
ca-certificates           2022.12.7            ha878542_0    conda-forge
curl                      7.88.1               hdc1c0ab_0    conda-forge
htslib                    1.9                  h244ad75_9    bioconda
isoseq3                   3.8.2                h9ee0642_0    bioconda
keyutils                  1.6.1                h166bdaf_0    conda-forge
krb5                      1.20.1               h81ceb04_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libcurl                   7.88.1               hdc1c0ab_0    conda-forge
libdeflate                1.3                  h516909a_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libnghttp2                1.51.0               hff17c54_0    conda-forge
libnsl                    2.0.0                h7f98852_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               hf14f497_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
lima                      2.7.1                h9ee0642_0    bioconda
ncurses                   6.3                  h27087fc_1    conda-forge
openssl                   3.0.8                h0b41bf4_0    conda-forge
pbbam                     1.0.6                hc16d5b3_1    bioconda
pbcopper                  1.3.0                h3e4de3e_0    bioconda
pbskera                   0.1.0                hdfd78af_0    bioconda
pip                       23.0.1             pyhd8ed1ab_0    conda-forge
python                    3.8.16          he550d4f_1_cpython    conda-forge
readline                  8.1.2                h0f457ee_0    conda-forge
setuptools                67.4.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
trim_isoseq_polya         0.0.3                h7c8eefc_0    bioconda
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge

Describe the Issue I have bulk MAS-IsoSeq reads that have been processed from movie.hifi_reads.bam as follows:

skera split movie.hifi_reads.bam  \
            MAS_adapters.fa \
            movie.segmented.hifi_reads.bam
lima movie.segmented.hifi_reads.bam \
            primers.fa \
            movie.fl.bam \
            --isoseq \
            --peek-guess
isoseq3 refine movie.fl.10X_5p--10X_3p.bam \
            primers.fa \
            movie.fltnc.bam \
            --require-polya \
            --verbose \
            --log-level INFO \
            --log-file movie.refine.log

isoseq3 refine json summary Of the almost 19 million input segmented reads, only 19k survive chimeric filtering and none survive polyA trimming.

{
    "_comment": "Created by pbcopper v2.1.0",
    "attributes": [
        {
            "id": "sample_name",
            "name": "Sample Name",
            "value": "unknown species"
        },
        {
            "id": "num_reads_fl",
            "name": "Full-Length Reads",
            "value": 18876077
        },
        {
            "id": "num_reads_flnc",
            "name": "Full-Length Non-Chimeric Reads",
            "value": 19172
        },
        {
            "id": "num_reads_flnc_polya",
            "name": "Full-Length Non-Chimeric Reads with Poly-A Tail",
            "value": 0
        }
    ],
    "dataset_uuids": [],
    "id": "isoseq3_refine",
    "plotGroups": [],
    "tables": [],
    "title": "Iso-Seq Refine Report",
    "uuid": "ffe8c923-b676-43d1-a221-4f41706a3cb4",
    "version": "1.0.1"
}

isoseq3 refine log output

>|> 20230224 19:14:10.139 -|- INFO -|- ParsePositionalArgs -|- 0x2b390020d4c0|| -|- Input Barcode file: input/accessory_files/primers.fa
>|> 20230224 19:14:10.140 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Primer prefixes used to detect concatemers:
>|> 20230224 19:14:10.141 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- CTACACGACGCTCTTCCGATCT
>|> 20230224 19:14:10.141 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- AAGCAGTGGTATCAACGCAGAG
>|> 20230224 19:14:10.141 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- AGATCGGAAGAGCGTCGTGTAG
>|> 20230224 19:14:10.141 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- CTCTGCGTTGATACCACTGCTT
>|> 20230224 19:14:10.142 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Output FLNC bam: mid/iso_refined/PNP_01/PNP_01.fltnc.bam
>|> 20230224 19:14:10.142 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Output summary json: mid/iso_refined/PNP_01/PNP_01.fltnc.filter_summary.report.json
>|> 20230224 19:24:56.575 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Input reads       : 18876077
>|> 20230224 19:24:56.575 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Output reads      : 0
>|> 20230224 19:24:56.575 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Output bases      : 0
>|> 20230224 19:24:56.575 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Filtered RQ reads : 0
>|> 20230224 19:24:56.576 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Missing RQ reads  : 0
>|> 20230224 19:24:56.576 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Run Time          : 10m 46s
>|> 20230224 19:24:56.576 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- CPU Time          : 1h 31m
>|> 20230224 19:24:56.576 -|- INFO -|- Runner -|- 0x2b390020d4c0|| -|- Peak RSS          : 0.266628 GB

primers.fa

>10X_5p
CTACACGACGCTCTTCCGATCT
>10X_3p
AAGCAGTGGTATCAACGCAGAG

Representative lima read The di:# tags from skera and bt/bl:Z:[primer] tags from lima are present. The read is clear of primer sequences and has a polyA tail.

m64381e_220729_121957/1/ccs/696_1516 4 * 0 255 * * 0 0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCCAGGTAGATTGACCAAACCACAGCATAATGACCCACTGTGAGAATAATGAACAAGAGTAATGCCAGCTCAGCATTGCTCATTTTTCTCACCCGCCTGTAGTAGAATACAGGCTGTCGCCAATCTGGAAGTCCATTGATCAGAATATCATCATACCTCTGCCTTCGTTCATCATCCTTTAAAACTTCATAAATGGCCACCAATTGTCTAAACTGAGTTTCTGCATTTTCATCTTTATTCTTGTCTGGATGTAAAGTTAGTGAAAGCTTACGATATGCTTTTCTGATGTCTGCAGATGATGCATCCTGCTGCACCCCGAGGAACTGGTAGAAGTTGAGCTGCACCTCCTCCACTAAGTCAAACAACTCCAGGTCTCCGCTCTCCCAGCCGCGCGCCGGCGCCACGGCGGCCAGCAGCAGCAGCAGCAGCCACAGCAGCGGCGTCCGCGGCGGCGGCGGCGGGAACGGCACCAGCCCGAGCTGGCGGCGTCCAGGAAGCTGCGCCGGCTGGGAGCAAGGAGCCGTCATCGCGCTGGGCTCGGAAAGGTCACCCGCCGCGCAGCTCCGTTGGCCGAGAGCTGGGACGTGGCGGGCGGCGCTGGCTGTGGGGAACAGCGCCTGTCAGTGAAAAGCGCGGGCAGGCGCACCGGAGCGGCCCGCCAGGTGGCTGGCCCCAGACAGAGCGCGGAGGCGGCGGGAGCCGGCTGCCGGACGGGCGGGTGGGTAGGCGGGCGGGGCCGCAGCCAGCGCTACGTTCCGAAGACCCTCGCCCCCAGGCCTACACCCCATGTA ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~l~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~f~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~k~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ML:B:C,0,154,0,55,132,1,0,0,0,0,0,65,35,1,0,2,0,1,212,193,180,6,2,9,50,3,0,4,0,0,2,7,0,0,4,59,73,0,0,11,47,7,0,5,0,0,0,0,0,0,0,0,0,0,120,0,0,11,1,1,1,16 MM:Z:C+m?,24,4,13,19,12,17,6,0,0,1,0,2,0,11,0,1,0,0,0,0,0,0,5,1,0,3,1,3,1,0,2,3,1,0,3,1,1,0,0,0,3,3,0,1,2,0,2,8,0,0,0,1,2,0,0,0,0,1,3,1,1,3; ac:B:i,24,0,24,0 bx:B:i,23,22 di:i:1 dl:i:1 dr:i:2 ds:B:C,130,164,108,101,102,116,145,133,165,108,97,98,101,108,161,49,162,113,101,205,2,161,162,113,115,205,2,145,164,113,117,97,108,176,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,163,115,101,113,176,65,67,84,84,71,84,65,65,71,67,84,71,84,67,84,65,165,114,105,103,104,116,145,133,165,108,97,98,101,108,161,50,162,113,101,205,6,18,162,113,115,205,6,2,164,113,117,97,108,176,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,163,115,101,113,176,65,67,84,67,84,71,84,67,65,71,71,84,67,67,71,65 ec:f:24.7954 ma:i:0 np:i:23 rq:f:0.999861 sn:B:f,13.6805,20.1716,4.76774,8.82862 we:i:10366278 ws:i:18 zm:i:1 qs:i:696 qe:i:1516 bc:B:S,0,1 bq:i:100 cx:i:12 bl:Z:TCTACACGACGCTCTTCCGATCT bt:Z:CTCTGCGTTGATACCACTGCTT ql:Z:~~~~~~~~~~~~~~~~~~~~~~+ qt:Z:~~~~~~~~~~~~~~~~~~~~~~ ls:B:C,134,162,98,99,164,48,45,45,49,162,98,113,100,164,108,101,97,100,132,164,99,53,109,99,147,2,0,3,164,113,53,109,99,147,32,1,5,162,113,108,183,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,43,162,115,113,183,84,67,84,65,67,65,67,71,65,67,71,67,84,67,84,84,67,67,71,65,84,67,84,166,110,101,115,116,101,100,192,164,112,53,109,99,1,165,116,114,97,105,108,132,164,99,53,109,99,145,14,164,113,53,109,99,145,2,162,113,108,182,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,126,162,115,113,182,67,84,67,84,71,67,71,84,84,71,65,84,65,67,67,65,67,84,71,67,84,84 RG:Z:7e958e55/0--1

armintoepfer commented 1 year ago

Have you figured it out on your own? If not, let us know