bonsai-team / Porechop_ABI

Adapter trimmer for Oxford Nanopore reads using ab initio method
GNU General Public License v3.0
33 stars 3 forks source link

will porechop_abi perfrom unnecessary trimming even in absence of Adapters? #10

Closed Rohit-Satyam closed 11 months ago

Rohit-Satyam commented 11 months ago

Hi @rchikhi @chadisaad

I am planning to use Porechop_abi in building an automated pipeline. However, I have a concern to share. Does Porechop_abi performs unnecessary trimming even in absence of adapters. I am asking this because I ran RabitQCPlus on my fastq files and it reported no adapter but then when I run porechop_abi using -abi argument, it finds some adapters ab-initio and removes a lot of reads. Now, I was thinking porechop_abi like fastp which performs trimming only if adapter presence is observed. I am still new with handling of MinIon data and I don't know a reliable method to detect the presence of adapters.

qbonenfant commented 11 months ago

Hi @Rohit-Satyam, I am the main developper of Porechop_ABI, @rchikhi and @chadisaad are not working on this project (afaik). If you want to contact someone involved in the project, and currently working in the Bonsai team, i invite you to check the corresponding author of our paper.

Short answer: just like the original Porechop, Porechop_abi should not trimm your dataset if no adapters are present. If -abi find something, you may want to align it against the reference genome of the sequenced organism.

But as always, it is not that simple.

Long Answer:

RabbitQCplus: From what I read from the supplementary materials of the RabbitQCplus paper, and line 185 in main.cpp on their master branch, I think the trimming part is implemented for NGS datasets only. fprintf(stderr, "WARNING : the TGS module does not support trimming and will not produce output files, so ignore the -o and -O parameter.\n");

This may explain why you do not see any adapter output. Even if they did, the current adapter strategy they use may not be applicable for long read anyway. They use a simple k-mer frequency based seed-and-extend strategy, which may not work very well on noisy reads such as ONT.

Our ABI core module uses a similar method, but with more complex (thus slower) algorithms, data structures and post processing ( 01*0 seeds, mixed assembly methods, frequency-based termination detection, consensus over multiple run, bi-directionnal debruijn graphs...). Since RabbitQCplus focus seems to be on speed and efficiency, this may not be of interest for the authors.

On a side note, I am not convinced fastp really can trimm ONT adapters either, and seeing the number of opened issues referring to adapters, i have doubts about it being reliable for this task.

Excessive Trimming Despite all of our efforts, there is one case where Porechop_ABI may want to trimm sequence with no adapters: over represented sequences. This may append if your dataset contains a lot of highly expressed RNA (ex: ribsomal RNA, constituve proteins such as actine, tubuline, or plant storage proteins like zein_alpha, etc...), or very short genomes (virus, plasmids) that can be fully covered by a single read.

What should you do ? In the case of Porechop_ABI, i would suggest you align the inferred adapter against the reference genome of the sequenced organism, or the NR database of the NCBI. It may help you to check the quality of your adapter. A good example of a (to be confirmed) great case of over-represented sequence interferring with the adapter end detection can be seen in my reply on issue #9.

Is there a reliable method for trimming ? I have been out of the academic world for almost a year now, and may be wrong about this: To this day, I am not sure there is a 100% reliable method to detect the presence of adapter sequence in ONT dataset, at least not until ONT will pubicly release all adapter sequences. Despite their effort to integrate trimming in their basecalling pipeline, ONT let the trimming part to an RNN, which is just a blackbox for end users. Dedicated QC and post processing tools are still required to ensure good data quality.

One of the tool i saw the most in ONT read processing pipelines is Porechop, on which we based our tool. Despite its quirk, the fact it was never published, and is even unsupported since 2018, Porechop is still in use today.

I won't make an exhaustive list of ONT trimming tools, but i would suggest picking one that match the goal you fixed for your pipeline (data quality, speed, ...), as long as you are transparent on what pre-processing you performed in your publications. In the end, this is what really matters.

Rohit-Satyam commented 11 months ago

Apologies regarding confusion about the developer. I think I was in haste of raising a query because I observed the following:

We have viral ONT data for Foot and Mouth Disease Virus and I observed the following in the logs of my sample:

Rebuild adapters:

Start
Consensus_1_start_(100.0%)
CCTTGTACTTCGTTCGGTTATCAGATTTGGGTGTTTAACCCCTCATCTTGTGAAGTTGTTTCGGGT

End
Consensus_1_end_(100.0%)
ACCGGCGTCTCTTTGAACCCTTTCAGGGTCTCTTTGAGATTCAAGCTACAGATCACTTTACCTGCGTTGGGTGAACGCCGTGTGCGGTGACGCATAATCCCTC

Building consensus adapter objects
The inference of adapters sequence is done.

and

                                        Best               
                                        read       Best    
                                        start      read end
  Set                                   %ID        %ID     
  SQK-NSK007                                86.2       79.2
  Rapid                                    100.0        0.0
  RBK004_upstream                           89.7        0.0
  SQK-MAP006                                75.9       77.3
  SQK-MAP006 short                          84.0       80.0
  PCR adapters 1                            81.8       78.3
  PCR adapters 2                            81.8       77.3
  PCR adapters 3                            78.3       76.9
  1D^2 part 1                               74.2       75.0
  1D^2 part 2                               79.4       73.3
  cDNA SSP                                  68.2       71.8
  Barcode 1 (reverse)                       76.0       77.8
  Barcode 2 (reverse)                       76.9       75.0
  Barcode 3 (reverse)                       75.0       75.0
  Barcode 4 (reverse)                       75.0       76.0
  Barcode 5 (reverse)                       76.0       78.6
  Barcode 6 (reverse)                       76.0       79.2
  Barcode 7 (reverse)                       75.0       76.9
  Barcode 8 (reverse)                       80.0       80.0
  Barcode 9 (reverse)                       75.0       76.9
  Barcode 10 (reverse)                      74.1       78.6
  Barcode 11 (reverse)                      80.0       76.0
  Barcode 12 (reverse)                      77.8       80.0
  Barcode 1 (forward)                       76.9       77.8
  Barcode 2 (forward)                       79.2       77.8
  Barcode 3 (forward)                       76.0       79.2
  Barcode 4 (forward)                       83.3       80.0
  Barcode 5 (forward)                       77.8       75.0
  Barcode 6 (forward)                       75.0       76.9
  Barcode 7 (forward)                       79.2       76.0
  Barcode 8 (forward)                       76.0       76.0
  Barcode 9 (forward)                       79.2       77.8
  Barcode 10 (forward)                      76.0       76.9
  Barcode 11 (forward)                      79.2       80.0
  Barcode 12 (forward)                      76.0       80.8
  Barcode 13 (forward)                      79.2       77.8
  Barcode 14 (forward)                      73.1       76.9
  Barcode 15 (forward)                      73.1       79.2
  Barcode 16 (forward)                      79.2       83.3
  Barcode 17 (forward)                      76.0       76.0
  Barcode 18 (forward)                      75.0       77.8
  Barcode 19 (forward)                      76.9       80.0
  Barcode 20 (forward)                      76.0       76.0
  Barcode 21 (forward)                      80.8       80.0
  Barcode 22 (forward)                      79.2       79.2
  Barcode 23 (forward)                      76.0       83.3
  Barcode 24 (forward)                      75.0       80.0
  Barcode 25 (forward)                      76.9       79.2
  Barcode 26 (forward)                      76.0       81.5
  Barcode 27 (forward)                      75.0       75.0
  Barcode 28 (forward)                      76.0       76.0
  Barcode 29 (forward)                      73.1       76.0
  Barcode 30 (forward)                      76.0       81.5
  Barcode 31 (forward)                      76.0       81.5
  Barcode 32 (forward)                      73.1       80.0
  Barcode 33 (forward)                      76.0       76.0
  Barcode 34 (forward)                      80.0       79.2
  Barcode 35 (forward)                      73.1       77.8
  Barcode 36 (forward)                      76.0       76.9
  Barcode 37 (forward)                      76.9       80.0
  Barcode 38 (forward)                      76.9       80.0
  Barcode 39 (forward)                      75.0       80.8
  Barcode 40 (forward)                      73.1       80.0
  Barcode 41 (forward)                      77.8       80.0
  Barcode 42 (forward)                      76.0       80.8
  Barcode 43 (forward)                      80.0       76.0
  Barcode 44 (forward)                      77.8       76.9
  Barcode 45 (forward)                      76.9       77.8
  Barcode 46 (forward)                      79.2       77.8
  Barcode 47 (forward)                      79.2       76.0
  Barcode 48 (forward)                      76.9       76.0
  Barcode 49 (forward)                      76.0       76.9
  Barcode 50 (forward)                      75.0       81.5
  Barcode 51 (forward)                      76.9       77.8
  Barcode 52 (forward)                      77.8       80.0
  Barcode 53 (forward)                      80.0       80.8
  Barcode 54 (forward)                      76.9       80.8
  Barcode 55 (forward)                      80.0       84.0
  Barcode 56 (forward)                      77.8       76.9
  Barcode 57 (forward)                      76.0       76.0
  Barcode 58 (forward)                      74.1       76.9
  Barcode 59 (forward)                      79.2       80.8
  Barcode 60 (forward)                      74.1       80.0
  Barcode 61 (forward)                      76.9       79.2
  Barcode 62 (forward)                      80.0       76.0
  Barcode 63 (forward)                      79.2       77.8
  Barcode 64 (forward)                      76.9       76.9
  Barcode 65 (forward)                      77.8       77.8
  Barcode 66 (forward)                      76.0       80.0
  Barcode 67 (forward)                      76.9       76.9
  Barcode 68 (forward)                      73.1       77.8
  Barcode 69 (forward)                      76.9       77.8
  Barcode 70 (forward)                      74.1       76.9
  Barcode 71 (forward)                      75.0       79.2
  Barcode 72 (forward)                      76.9       80.0
  Barcode 73 (forward)                      75.0       76.9
  Barcode 74 (forward)                      76.9       80.0
  Barcode 75 (forward)                      73.1       76.0
  Barcode 76 (forward)                      75.0       79.2
  Barcode 77 (forward)                      76.0       76.0
  Barcode 78 (forward)                      75.0       76.0
  Barcode 79 (forward)                      76.0       80.0
  Barcode 80 (forward)                      76.9       76.0
  Barcode 81 (forward)                     100.0      100.0
  Barcode 82 (forward)                      75.0       80.0
  Barcode 83 (forward)                      75.0       79.2
  Barcode 84 (forward)                      83.3       79.2
  Barcode 85 (forward)                      76.9       76.9
  Barcode 86 (forward)                      73.1       77.8
  Barcode 87 (forward)                      77.8       77.8
  Barcode 88 (forward)                      76.0       75.0
  Barcode 89 (forward)                      79.2       79.2
  Barcode 90 (forward)                      76.9       76.0
  Barcode 91 (forward)                      75.0       77.8
  Barcode 92 (forward)                      76.0       76.9
  Barcode 93 (forward)                      76.9       76.7
  Barcode 94 (forward)                      76.9       76.9
  Barcode 95 (forward)                      75.0       80.8
  Barcode 96 (forward)                      75.0       76.9
  Consensus_1_start_(100.0%)_adapter        97.1        0.0
  Consensus_1_end_(100.0%)_adapter           0.0       98.1

Trimming adapters from read ends
               Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA
                        BC81: CCTCATCTTGTGAAGTTGTTTCGG
                    BC81_rev: CCGAAACAACTTCACAAGATGAGG
  Consensus_1_start_(100.0%): CCTTGTACTTCGTTCGGTTATCAGATTTGGGTGTTTAACCCCTCATCTTGTGAAGTTGTTTCGGGT

    Consensus_1_end_(100.0%): ACCGGCGTCTCTTTGAACCCTTTCAGGGTCTCTTTGAGATTCAAGCTACAGATCACTTTACCTGCGTTGGGTGAACGCCGTGTGCGGTGACGCATAATCCCTC


When I blast these two sequences which here are being considered as Consensus_1_start and Consensus_1_end adapters, I get a hit against FMD Virus for Consensus_1_end_ (Consensus_1_start was too small to get any hit)

image

I thought of using more advanced porechop_abi rather than Porechop for the same reason of it not being updated since 2018. Since I have handful of samples, let me see what impact does using porechop_abi will have on my assemblies compared to when I skip this step. Thanks.

qbonenfant commented 11 months ago

Hi, It seems your dataset will be quite challenging for our tool.

Porechop_ABI will struggle to separate adapters from biological sequences on viral datasets. Also, because ONT reads are not complete most of the time, it is normal that the inferred end adapter looks like FMD virus DNA, because it probably is. I won't enter in a lengthy explanation about why this happens here, but if you want more details, you can check our supplementary materials.

For your start adapter, despite not matching anything relevant on the NR database, it seems to align pretty well on the known adapters from Porechop database.

Here is a very crude alignment between Start_Consensus_1 and the Rapid Barcoding Kit adapters + barcode 81:

Start consensus
CCTTGTACTTCGTTCGGTTATCAGATTTGGGTGTTTAACC CCTCATCTTGTGAAGTTGTTTCGGGT
-AATGTACTTCGTTCAGTTA-CGGC-TTGGGTGTTTAACC CCTCATCTTGTGAAGTTGTTTCGG
RBK-004                                  Barcode 81

Again, Porechop_ABI do not like barcoded dataset. We adjusted our method to cope with mixed adapter contents, but this is still far from the ideal case.

Here it seems to have worked just fine, and what you have looks like a legitimate adapter. Still, I won't be surprised if you found a slightly different adapter on your next run.

If you want to remove only relevant sequences from this kind of datasets, i am afraid some kind of sanity check (manual or not) will be mandatory.

Rohit-Satyam commented 11 months ago

I think I will skip this step for now until I find a better way to deal with these residual barcodes.