Can't split CCS bam file during 'preproc:split_sample_amplicons:SSA' step

sh940202123 commented 2 years ago

Hi, thank's for the hard work in this pipeline. Everything works fine with the test dataset, but it reported error with real pacbio subread.bam file.

Here is the command which I use.

➜ nextflow run main.nf -resume -profile preproc,docker -c PGx-2D6-R10.preproc.config
N E X T F L O W  ~  version 21.04.3                                                                                                                    
Launching `main.nf` [grave_faggin] - revision: bcbedb7c28                                                                                              

------- PLASTER: pre-processing -------                                                                                                                

executor >  local (2)                                                                                                                                  
[3d/a0ede2] process > preproc:prep_ref:wget                  [100%] 1 of 1, cached: 1 ✔                                                                
[66/be061e] process > preproc:prep_ref:mmi                   [100%] 1 of 1, cached: 1 ✔                                                                
[c0/9249f5] process > preproc:pb_ccs:ccs (92)                [100%] 100 of 100, cached: 100 ✔                             
[09/1fc07e] process > preproc:pb_ccs:merge                   [100%] 1 of 1, cached: 1 ✔
[a5/694c74] process > preproc:extract_ccs_failed             [  0%] 0 of 1
executor >  local (2)                                                       
[3d/a0ede2] process > preproc:prep_ref:wget                  [100%] 1 of 1, cached: 1 ✔
[66/be061e] process > preproc:prep_ref:mmi                   [100%] 1 of 1, cached: 1 ✔                                
[c0/9249f5] process > preproc:pb_ccs:ccs (92)                [100%] 100 of 100, cached: 100 ✔
[09/1fc07e] process > preproc:pb_ccs:merge                   [100%] 1 of 1, cached: 1 ✔
[-        ] process > preproc:extract_ccs_failed             -              
[99/5e159d] process > preproc:pb_lima:lima (CCS)             [100%] 1 of 1, cached: 1
[-        ] process > preproc:pb_lima:merge_smry             -                                                                                         
[4b/5f1598] process > preproc:pb_mm2 (CCS:false)             [100%] 2 of 2, cached: 2
[0b/1728b8] process > preproc:annotate_samples:AS (CCS:true) [100%] 1 of 1, cached: 1
[78/ceefa6] process > preproc:annotate_amplicons (CCS:true)  [100%] 2 of 2, cached: 2
[4a/6209cc] process > preproc:pb_mm2_2 (CCS:true)            [100%] 2 of 2, cached: 2
[c3/6f3792] process > preproc:split_sample_amplicons:SSA (1) [100%] 1 of 1, failed: 1
[-        ] process > preproc:index_bam                      -              
[a1/0e2595] process > preproc:alignment_stats (CCS:true)     [100%] 2 of 2, cached: 2
[-        ] process > preproc:pre_processing_report          -
Error executing process > 'preproc:split_sample_amplicons:SSA (1)'

Caused by:                            
  Missing output file(s) `*LB-*.SM-*.AM-*.bam` expected by process `preproc:split_sample_amplicons:SSA (1)`

Command executed:

  samtools index preproc-run.CCS.true.sm_am_annot.mm2.bam
  samtools view -u preproc-run.CCS.true.sm_am_annot.mm2.bam |
      bam_split_sample_amplicons.py - --lb-tag preproc-run

Command exit status:
  0                                                                         

Command output:
  (empty)                             

Command error:                                                              
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.

Work dir:
  /home/jerry/pipline/PLASTER/work/c3/6f3792b6923c7f8456df14316a0b60

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

And it didn't output any split bam files under the work path.

➜ ls -1a work/c3/6f3792b6923c7f8456df14316a0b60/                                                                                                (base) 
./
../
.command.begin
.command.err
.command.log
.command.out
.command.run
.command.sh
.command.trace
.exitcode
preproc-run.CCS.true.sm_am_annot.mm2.bam@
preproc-run.CCS.true.sm_am_annot.mm2.bam.bai

bin/bam_split_sample_amplicons.py file, line 35-55

def main(in_bam, lb_tag, prefix):
out_bams = {}
counts = {}
with AlignmentFile(in_bam, 'rb', check_sq=False) as af:
    header_dict = af.header.to_dict()
    q = Queue(5)
    BamWriteThread(q).start()
    for read in af:
        if get_tag(read, 'PP') == '1':
            read.set_tag('RG', '1')
            sm_am = (read.get_tag('SM'), read.get_tag('AM'))
            print("name")
            if sm_am not in out_bams:
                out_bams[sm_am] = bam_file(prefix, lb_tag, *sm_am, header_dict)
                counts[sm_am] = 0
            q.put((out_bams[sm_am], read))
            counts[sm_am] += 1
    q.join()
for k, bam in out_bams.items():
    bam.close()
    print(counts[k], file=open(bam.filename.decode() + '.count', 'wt'))

It seems only the reads with PP tag & equal 1 will be processed & output to bam file. But the input bam file preproc-run.CCS.true.sm_am_annot.mm2.bam didn't found any read have PP tag with "1" value. Do I need to process my subread before using this pipeline? Thank you for the time to check this error for me.

Best, Jerry

Additional Info

PGx-2D6-R10.preproc.config file

params {
subreads_bam = '@PROJECT_DIR@/PGx-2D6-R10/preproc/m54126_211116_085604.subreads.bam'
barcodes_fasta = '@PROJECT_DIR@/PGx-2D6-R10/preproc/barcodes.fasta'
amplicons_json = '@PROJECT_DIR@/PGx-2D6-R10/preproc/amplicons.json'
ref_fasta = 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz'
}

PGx-2D6-R10 folder

PGx-2D6-R10/preproc
├── amplicons.json
├── barcodes.fasta
├── m54126_211116_085604.subreads.bam
└── m54126_211116_085604.subreads.bam.pbi

amplicons.json file

{
"CYP2D6": {
"chrom": "chr22",
"start": 42125398,
"end": 42131503,
"strand": "-",
"fwd_primer": "TGTGAATATTGTCTTTGTGTGGGTG",
"rvs_primer": "CAGGACTCAGGTAATCATATGCTCA"
},
"CYP2D7": {
"chrom": "chr22",
"start": 42137550,
"end": 42145176,
"strand": "-",
"fwd_primer": "TGTGAATATTGTCTTTGTGTGGGTG",
"rvs_primer": "CAGGACTCAGGTAATCATATGCTCA"
}
}

part of barcodes.fasta file, total 96 barcodes included

>bc1001
CACATATCAGAGTGCG
>bc1002
ACACACAGACTGTGAG
>bc1003
ACACATCTCGTGAGAG
>bc1004
CACGCACACACGCGCG
>bc1005
CACTCGACTCTCGCGT

jemunro commented 2 years ago

The PP ("Proper Pair") tag is used to indicate the gene-specific primers are present and in the correct orientation. I note that you are using the same primer specification as the publication. Was your data generated with these primers? If not then none of then reads will have the PP tag set.

sh940202123 commented 2 years ago

It works after change to proper primer sequence, thanks~~:) But I got another error at last process "preproc:annotate_amplicons (SR:false)"

[10/c212ce] process > preproc:index_bam (bc1066:CYP2D6)      [100%] 14 of 14, cached: 14                                                         [4/45]
[29/38be26] process > preproc:alignment_stats (CCS:true)     [100%] 2 of 2, cached: 2                                                                  
[-        ] process > preproc:pre_processing_report          -                                                                                         
Error executing process > 'preproc:annotate_amplicons (SR:false)'                                                                                      

Caused by:                                                                                                                                             
  Process `preproc:annotate_amplicons (SR:false)` terminated with an error exit status (1)

Command executed:

  samtools view -u preproc-run.SR.lima.no_bc.mm2.bam |                      
      bam_annotate_amplicons.py - \                                         
          --window 500 \                                                    
          --max-dist 2 \                                                    
          --amplicons amplicons.json \
          --out preproc-run.SR.false.sm_am_annot.bam

Command exit status:
  1                                   

Command output:                                                             
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File "/home/jerry/pipline/PLASTER/bin/bam_annotate_amplicons.py", line 177, in <module>
      main(args.in_bam, args.amplicons, args.window, args.max_dist, args.out)
    File "/home/jerry/pipline/PLASTER/bin/bam_annotate_amplicons.py", line 127, in main
      mp = match_pattern(read.seq[:window], amp[primer], max_dist=max_dist, refine=True)
    File "/home/jerry/pipline/PLASTER/bin/bam_annotate_amplicons.py", line 35, in match_pattern
      idx, min_dist = min(enumerate(ds), key=operator.itemgetter(1))
  ValueError: min() arg is an empty sequence

Work dir:
  /home/jerry/pipline/PLASTER/work/b9/fae13008ad31d1b0258dff26330289

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

It seems it can't annotate by the primer which I use. Are these primers not compatible with this pipeline?

jemunro commented 2 years ago

Hi Jerry,

Any primers should be compatible.

I think you might be running into some kind of edge case bug. Are you able to check a few things for me so I can figure out what is going wrong?

In the working directory /home/jerry/pipline/PLASTER/work/b9/fae13008ad31d1b0258dff26330289, can you run a few commands and let me know the output (samtools required):

samtools view preproc-run.SR.lima.no_bc.mm2.bam | wc -l
samtools view preproc-run.SR.false.sm_am_annot.bam | wc -l
cat amplicons.json

Cheers

sh940202123 commented 2 years ago

Hi Jemunro,

Here is the output, and it seems not all reads are processed.

➜ samtools view preproc-run.SR.lima.no_bc.mm2.bam | wc -l                                                                                   (samtools) 
3520764

➜ samtools view preproc-run.SR.false.sm_am_annot.bam | wc -l                                                                                (samtools) 
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
188849

➜ cat amplicons.json                                                                                                                        (samtools) 
{
  "CYP2D6": {
    "chrom": "chr22",
    "start": 42125398,
    "end": 42131503,
    "strand": "-",
    "fwd_primer": "GCAGTCGAACATGTAGCTGACTCAGGTCACATGGCAGCTGCCATACAATCCACCTG",
    "rvs_primer": "TGGATCACTTGTGCAAGCATCACATCGTAGCGACTGAGCCCTGGGAGGTAGGTAG"
  },
  "CYP2D7": {
    "chrom": "chr22",
    "start": 42137550,
    "end": 42145176,
    "strand": "-",
    "fwd_primer": "TGTGAATATTGTCTTTGTGTGGGTG",
    "rvs_primer": "CAGGACTCAGGTAATCATATGCTCA"
  }
}

jemunro commented 2 years ago

Hi Jerry,

A couple of things:

Double check the start and end coordinates of your primers. Based on a ncbi blast it looks like your fwd_primer is closer to 42132626 (end).
Do the primers contain adaptor/universal barcoding sequence? If so you should remove that and just retain the portion that matches the genomic sequence.
Do your primers amplify both CYP2D6 and CYP2D7 (as is the case in the sample dataset and the publication)? If not I suggest you delete the CYP2D7 entry from amplicons.json.

Could you try ammending the above points and let me know if that fixes the issue.

Cheers

jemunro commented 2 years ago

It would also be very helpful if you could share part of the BAM file that is causing the issue, e.g. with:


cd /home/jerry/pipline/PLASTER/work/b9/fae13008ad31d1b0258dff26330289
(samtools view -H preproc-run.SR.lima.no_bc.mm2.bam; samtools view preproc-run.SR.lima.no_bc.mm2.bam.bam | head -n188899 | tail -n100) | samtools view -b -o sample.bam`

sh940202123 commented 2 years ago

Hi Jemunro,

It works after changing the primer sequence of amplicon.json file. Thanks for your timely help!

We amplified extra bases of CYP2D6 from both upstream & downstream.
Yes, the first 29 bases are pacbio universal adaptor sequences.
We only sequencing CYP2D6 in this datasets. So I removed CYP2D7 info from preproc/amplicons.json, and it runs normally.

Then I tried to run typing process and it have similar issues as pre-process. Do I need to extend the coordinate of typing/amplicon.json file as same as preproc process? Can I remove CYP2D7 & fusion gene info in typing/amplicon.json as same as preproc process?

preproc/amplicon.json

{
  "CYP2D6": {
    "chrom": "chr22",
    "start": 42126037,
    "end": 42132626,
    "strand": "-",
    "fwd_primer": "CATGGCAGCTGCCATACAATCCACCTG",
    "rvs_primer": "CGACTGAGCCCTGGGAGGTAGGTAG"
  }
}

typing/amplicon.json

{
  "CYP2D6": {
    "chrom": "chr22",
    "start": 42125398,
    "end": 42131503,
    "strand": "-",
    "fusion": "CYP2D7",
    "pharmvar_gene": "CYP2D6",
    "pharmvar_ver": "4.2.6.1",
    "vep_feature":"ENST00000645361"
  },
  "CYP2D7": {
    "chrom": "chr22",
    "start": 42137550,
    "end": 42145176,
    "strand": "-"
  }
}

Best, By Jerry

jemunro commented 2 years ago

Hi Jerry,

Thats right, based on you preprocessing amplicons.json, you should use the following amplicons.json for the typing component:

{
  "CYP2D6": {
    "chrom": "chr22",
    "start": 42126037,
    "end": 42132626,
    "strand": "-",
    "fwd_primer": "CATGGCAGCTGCCATACAATCCACCTG",
    "rvs_primer": "CGACTGAGCCCTGGGAGGTAGGTAG"
    "pharmvar_gene": "CYP2D6",
    "pharmvar_ver": "4.2.6.1",
    "vep_feature":"ENST00000645361"
  }
}

Note that you could also use this for the preprocessing component; pharmvar_gene, pharmvar_ver and vep_feature are only used by the typing component and would be ignored at the preprocessing stage.

jemunro commented 2 years ago

Hi Jerry, just cheking if this is working for you now so I can close out this issue?

bahlolab / PLASTER

Can't split CCS bam file during 'preproc:split_sample_amplicons:SSA' step #11