PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
251 stars 45 forks source link

no bam output after running isoseq primer removal #7

Closed zhenzhenyang-psu closed 6 years ago

zhenzhenyang-psu commented 6 years ago

Dear Armin, Thanks for developing isoseq3 tool to analyze the pacBio IsoSeq data. I have two samples (wild type and knock out of a gene of interest) for which I generated pacBio isoSeq data to investigate alternative splicing.

For wt samples, the primer removal turns out to be good. However, for the knock out sample, I didn't get any bam output. Yet for the wild type sample, I got the bam output "wt.demux.primer_5p--primer_3p.bam".

The commands I ran for both samples are similar: (1) For knock out sample: lima -j 10 --isoseq --dump-clips --no-pbi ../1_ccs/ko.bam primers.fasta ko.demux.bam

(2) for wild type sample: lima -j 10 --isoseq --dump-clips --no-pbi ../1_ccs/wt.bam primers.fasta wt.demux.bam

The summary output for wild type (wt.demux.lima.summary) is like below: ZMWs input (A) : 240617 ZMWs above all thresholds (B) : 179301 (75%) ZMWs below any threshold (C) : 61316 (25%)

ZMW marginals for (C): Below min length : 1582 (3%) Below min score : 0 (0%) Below min end score : 36519 (60%) Below min passes : 393 (1%) Below min score lead : 0 (0%) Below min ref span : 21377 (35%) Without adapter : 393 (1%) Undesired 5p--5p pairs : 13252 (22%) Undesired 3p--3p pairs : 27381 (45%) Undesired no hit : 393 (1%)

ZMWs for (B): With different barcodes : 179301 (100%) Coefficient of correlation : 0%

ZMWs for (A): Allow diff barcode pair : 240224 (100%) Allow same barcode pair : 240224 (100%)

Reads for (B): Above length : 179301 (100%) Below length : 0 (0%)

The output for knock out sample (ko.demux.lima.summary) is like below: ZMWs input (A) : 44194 ZMWs above all thresholds (B) : 0 (0%) ZMWs below any threshold (C) : 44194 (100%)

ZMW marginals for (C): Below min length : 587 (1%) Below min score : 0 (0%) Below min end score : 43616 (99%) Below min passes : 578 (1%) Below min score lead : 0 (0%) Below min ref span : 43559 (99%) Without adapter : 578 (1%) Undesired 5p--5p pairs : 22716 (51%) Undesired 3p--3p pairs : 16185 (37%) Undesired no hit : 578 (1%)

ZMWs for (B): Coefficient of correlation : -nan%

ZMWs for (A): Allow diff barcode pair : 43616 (99%) Allow same barcode pair : 43616 (99%)

Reads for (B): Above length : 0 (-nan%) Below length : 0 (-nan%)

I noticed that the number of above length contigs in knockout is 0. Does this mean that it is a failed run in the knockout sample?

Also, I wonder what "reads for (A) and reads for (B) mean in the summary output.

As you are the expert who wrote the software, do you have an idea of what this means?

Many thanks for your time in reading this email.

I look forward to hearing from you.

Best,

Zhenzhen

armintoepfer commented 6 years ago

Dear Zhenzhen,

thank you for using lima and isoseq3. Indeed, the WT sample looks good. The culprit for KO sample is this line:

Below min end score : 43616 (99%)

It looks like it could not find the primers. I suspect that during sample prep, primer were not added.

I recommend that you run your KO sample through CCS, just like you would do after lima. Once you have the CCS reads, try to find the primers by aligning them them against the begin and end of each CCS reads. That will give us further clarification, if you either used different primers for your KO sample or primers are missing.

Let me know how that works for you.

zhenzhenyang-psu commented 6 years ago

Dear Armin, Thanks for your reply.

I will use bam2fastx package to convert my bam file to fasta sequences, and then align against themselves to identify the possible primers.

Thanks for your insightful idea.

Another thing I noticed is that my input bam file for wild type is 11G, whereas for knockout sample is 6.4G. I am thinking whether it means it is a possible failure of the run?

I will run the analysis and let you know.

Many thanks,

Zhenzhen

armintoepfer commented 6 years ago

Troubleshooting your individual sequencing run is beyond what I can offer here. Please refer to the person that ran the sequencing of your sample for further run metrics.

zhenzhenyang-psu commented 6 years ago

Hi Armin, I examined the reads after running ccs for both wild type and knockout samples. By selecting and multiple aligning the first and last 25 nucleotides of the reads for knockout sample and wild type sample, I found there are some consensus primer sequences in wild type sample, but not in knock out.

So you are right!

I have contacted the person who ran the experiment, she said that she added primers. So somehow after sequencing in knockout sample, the reads in the knockout sample didn't contain primer sequences as in wild type sample. I dodn't know how this happened.

Regardless, I was wondering if there is a way to run "lima isoseq" without removing primer sequences.

The current isoseq3 command for this step requires the user to provide primers.fasta. For the knockout sample, I only provided one single nucleotide for 5' and 3'primer to make it run.

primer_5p A primer_3p G

It generated "ko.demux.primer_5p--primer_3p.bam" successfully. However, after running following clustering step, it didn't generate "ko.unpolished.bam" for knockout sample. Yet it didn't report any error.

I am not sure if no primer sequences are detected in the output sequencing reads is related to spurious chemistry.

As it is really expensive for us to rerun the sample, do you think it is possible for you to fix the code for cases where no primer sequences can be used while running the isoseq3 pipeline?

Really really sorry to bother you.

Thanks,

Look forward to your response.

Zhenzhen

armintoepfer commented 6 years ago

Hi Zhenzhen,

our software is for samples that were correctly prepared. There is no way to fake your way through lima. The only thing you can try, use the ccs data directly in isoseq3 cluster. You might get something out, but it will be subsubsuboptimal. This is beyond the official protocol and outside performance guidelines.

I'm closing this issue, because your sample is physically missing primers, which is unsupported.

zhenzhenyang-psu commented 6 years ago

Okay. I used "isoseq3 cluster" directly on the ccs bam, it still failed.

thanks!