isoseq collapse creates empty output

hitowie commented 10 months ago

Operating system Which operating system and version are you using?

Linux HPC cluster

Package name Which package / tool is causing the problem? Which version are you using, use tool --version. Have you updated to the latest version conda update package? Have you updated the complete env by running conda update --all? Have you ensured that your channel priorities are set up according to the bioconda recommendations at https://bioconda.github.io/#set-up-channels?

isoseq version 4.0.0 conda version 4.7.10

Conda environment What is the result of conda list? (Try to paste that between triple backticks.)

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
ca-certificates           2023.08.22           h06a4308_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
isoseq                    4.0.0                h9ee0642_0    bioconda
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
lima                      2.7.1                h9ee0642_0    bioconda
ncurses                   6.3                  h7f8727e_2  
openssl                   1.1.1w               h7f8727e_0  
pbccs                     6.4.0                h9ee0642_0    bioconda
pbmm2                     1.10.0               h9ee0642_0    bioconda
pbpigeon                  1.0.0                hdfd78af_0    bioconda
pip                       19.3.1                   py27_0  
python                    2.7.18               ha1903f6_2  
readline                  8.1.2                h7f8727e_1  
samtools                  1.3.1                         0    bioconda
setuptools                44.0.0                   py27_0  
sqlite                    3.38.5               hc218d9a_0  
tk                        8.6.12               h1ccaba5_0  
wheel                     0.37.1             pyhd3eb1b0_0  
zlib                      1.2.12               h7f8727e_2

Describe the bug A clear and concise description of what the bug is.

I have followed the CLI workflow for Clustering with files after ccs, demultiplex and barcode removal.
$ isoseq refine <fl.bam> <primers.fasta> <flnc.bam>

$ isoseq cluster

> Then moved on to the [CLI workflow for Classification](https://isoseq.how/classification/workflow.html).

$ pbmm2 align --preset ISOSEQ --sort

$ isoseq collapse --do-not-collapse-extra-5exons

> The output files for <collapsed.gff|fasta|txt> are all empty but did not produce an error.

**Error message**
Paste the error message / stack.
>collapse log-file doesn't contain any errors but the output files are empty

|> 20231101 15:15:28.110 -|- INFO -|- ParseInputFiles -|- 0x2ae44a3134c0|| -|- Input ccs is refined.bc1001--bc1001.Q40.flnc.bam |> 20231101 15:15:28.110 -|- INFO -|- ParseInputFiles -|- 0x2ae44a3134c0|| -|- Output prefix is collapsed.bc1001--1001.Q40 |> 20231101 15:15:28.110 -|- INFO -|- CollapseCaller -|- 0x2ae44a3134c0|| -|- Start of computation |> 20231101 15:15:28.110 -|- INFO -|- Run -|- 0x2ae44a3134c0|| -|- Num threads: 48 |> 20231101 15:15:28.125 -|- DEBUG -|- Run -|- 0x2ae44a3134c0|| -|- Transcripts do not contain quality values, will not output collapsed.bc1001--1001.Q40.fastq! |> 20231101 15:15:28.125 -|- INFO -|- Run -|- 0x2ae44a3134c0|| -|- Data will be loaded in batches of size: 4096.000000 MB |> 20231101 15:15:28.131 -|- WARN -|- FlncInfoByName -|- 0x2ae44a3134c0|| -|- Read group information not found for m54363_231019_004600/37880333/ccs |> 20231101 15:15:28.134 -|- INFO -|- Run -|- 0x2ae44a3134c0|| -|- Done. |> 20231101 15:15:28.136 -|- INFO -|- CollapseCaller -|- 0x2ae44a3134c0|| -|- End of computation |> 20231101 15:15:28.136 -|- INFO -|- Runner -|- 0x2ae44a3134c0|| -|- Run Time: 27ms 245us |> 20231101 15:15:28.136 -|- INFO -|- Runner -|- 0x2ae44a3134c0|| -|- CPU Time: 29ms 214us |> 20231101 15:15:28.136 -|- INFO -|- Runner -|- 0x2ae44a3134c0|| -|- Peak RSS: 0.00336838 GB

To Reproduce Steps to reproduce the behavior. Providing a minimal test dataset on which we can reproduce the behavior will generally lead to quicker turnaround time!

Can provide files upon request.

Expected behavior A clear and concise description of what you expected to happen.

Any output generated to <collapsed.fasta|gff|txt>.

jmattick commented 10 months ago

Hi @hitowie,

I noticed you have a warning in the logs. Please confirm that your input is correct. Read group information not found for m54363_231019_004600/37880333/ccs

Also we have recently updated our docs for thecluster2 tool which is a new replacement for cluster here.

If that does not help please upload a minimal reproducible dataset.

hitowie commented 10 months ago

Thank you. I've ran the pipeline again with cluster2 with the same result. I've uploaded my dataset to reproduce the issue.

From what I can tell inspecting all <.bam> files with samtools view -H <.bam> | grep '^@RG' and samtools view <.bam> | grep 'm54363_231019_004600/37880333/ccs' the RG exists.

armintoepfer commented 10 months ago

Thank you for uploading your dataset. We are going to investigate and will inform you.

jmattick commented 10 months ago

Hi @hitowie. It looks like you did not remove the primers with lima before refine. I see that the previous lima step was using Sequel_384_barcodes_v1.barcodeset.xml. If you have a double demux design please run lima again to remove the isoseq primers. Otherwise, undo demultiplexing and use the primers.fasta. $ lima movieX.hifi_reads.bam primers.fasta movieX.fl.bam --isoseq --peek-guess https://isoseq.how/clustering/cli-workflow.html#step-2---primer-removal-and-demultiplexing

hitowie commented 10 months ago

In what way do the target-specific primer sequences influence the downstream analysis? I did targeted sequencing from cDNA with custom primers.

I realized that I provided wrong primer sequences. Here is the correct primers.fasta file:

>5p
GCAGTCGAACATGTAGCTGACTCAGGTCACTTGTACCAGCTACGGACTGC
>3p
TGGATCACTTGTGCAAGCATCACATCGTAGCGTCCACTCTCTGACATGGG

Now I've ran lima again to remove those primers. $ lima <demux.bam> <primers.fasta> <fl.bam> --isoseq --peek-guess --log-level TRACE --log-file lima.log

Here's the fl.lima.summary

ZMWs input                (A) : 3089
ZMWs above all thresholds (B) : 3087 (99.94%)
ZMWs below any threshold  (C) : 2 (0.06%)

ZMW marginals for (C):
Below min length              : 0 (0.00%)
Below min score               : 0 (0.00%)
Below min end score           : 0 (0.00%)
Below min passes              : 0 (0.00%)
Below min score lead          : 0 (0.00%)
Below min ref span            : 0 (0.00%)
Without SMRTbell adapter      : 0 (0.00%)
Undesired 5p--5p pairs        : 1 (50.00%)
Undesired 3p--3p pairs        : 1 (50.00%)

ZMWs for (B):
With different pair           : 3087 (100.00%)
Coefficient of correlation    : 0.00%

ZMWs for (A):
Allow diff pair               : 3089 (100.00%)
Allow same pair               : 3089 (100.00%)

Reads for (B):
Above length                  : 3087 (100.00%)
Below length                  : 0 (0.00%)

and fl.lima.counts

IdxFirst        IdxCombined     IdxFirstNamed   IdxCombinedNamed        Counts  MeanScore
0       1       5p      3p      3087    99

However, the generated fl.5p--3p.bam file appears empty. Am I missing any filtering options for the lima primer removal step or why are those 3087 reads not written to BAM?

jmattick commented 10 months ago

Hi @hitowie. When I do the following steps I get a non-empty collapsed gff.

lima-undo lima.bc1001--bc1001.Q40.fl.bam undo.bam
lima undo.bam correct_primers.fasta redo.lima.bam --isoseq --peek-guess
isoseq refine redo.lima.5p--3p.bam correct_primers.fasta refine.bam
isoseq cluster2 refine.bam cluster2.bam
pbmm2 align --preset ISOSEQ --sort cluster2.bam mouse_GRCm39.fasta mapped.bam
isoseq collapse --do-not-collapse-extra-5exons mapped.bam collapsed.gff

If this does not work for you, we will have an updated version on Bioconda soon.

hitowie commented 10 months ago

Thank you! Primer removal helped. I just had to clear some disk space to write the BAM files. I was able to run isoseq succesfully. Issue is closed.

PacificBiosciences / pbbioconda

isoseq collapse creates empty output #616