bonsai-team / Porechop_ABI

Adapter trimmer for Oxford Nanopore reads using ab initio method
GNU General Public License v3.0
35 stars 4 forks source link

0% Killed #9

Closed nstepi62 closed 1 year ago

nstepi62 commented 1 year ago

I run the workflow with the standard command: "porechop_abi --ab_initio -i fastq_pass/combined1.fastq.gz -o fastq_trimmed/trimmed.fastq" which resulted in " Consensus_1start(100.0%)_adapter 98.0 0.0 Consensus_1end(100.0%)_adapter 0.0 98.7

Trimming adapters from read ends SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT Consensus_1start(100.0%): TATGTAACCTACTTGGTTCAGTTACGTATTGCTAATGATACGGCGACCACCGAGATCTACACAAGCGACTACACTCTTTCCCTACACGACGCTCTT Consensus_1end(100.0%): TAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACTGTAGATCTCGTATGCCGTCTTCTGCTTAGCAATACGTA

6,260 / 14,028,394 (0.0%)Killed"

And no output file. Since there is neither a low frequency, nor a poor consensus warning, I wonder what the issue might be. Run on WSL2 (on Windows 11) in a new conda environment on short read Nanopore data (300bp). Expected lenght of reads without adaptors 170-200bp.

Any suggestions what the issue might be?

qbonenfant commented 1 year ago

Hi, There are several reasons a linux process could be killed like this, but my guess would be that your system ran out of memory.

The ABI core module should not have any issue processing 14M reads of this size, since we used C++ and (mostly) efficient data structures. The crash seems to happens during the trimming phase, wich uses the original Porechop code (in python). The python part of porechop tries to store the whole dataset in RAM using text format (String). This is a known bottleneck in Porechop_ABI workflow, and is not easy to fix.

What are the Solutions?

1) Rewrite the Python part of Porechop that handle the reads. Since it represents at least half of the codebase, this is not really practical.

2) Split your dataset in smaller chunk (2 to 5M reads should be fine), process them sequentially, then merge the trimmed files. This is the easiest way, as it can be automated quite easily and should yield exactly the same results if you use the same adapters accross all runs.

To use the inferred adapters you already have, you can save them in a custom adapter file (see https://github.com/bonsai-team/Porechop_ABI#custom-adapters), and they will be added to the existing database for the current run.

But... before you proceed, i would suggest you manually check your adapter:

After testing (blastn against NR database) the inferred adapters you sent, i saw a lot of hits on viral sequences like SarsCov2, but only after base 30.

Here are the first hits of Consensus_1_start blast_hit

If you did try to trim a viral dataset (or a very small genome), this would suggest part of the inferred adapter is actually from the source organism. Porechop_ABI will struggle to separate adapters from the organism DNA/RNA on highly covered sequences since we use a frequency-base approach.

When working on such organism, we recommand aligning the adapters on the reference genome (if available), to avoid excessive trimming.

nstepi62 commented 1 year ago

Thank you so much for your advice I'm working on it :)

qbonenfant commented 1 year ago

Your welcome. If you need any help to implement the dataset splitting/merging, I can add a python script in the tools folder for this purpose (it's quite empty right now).

nstepi62 commented 1 year ago

Hi again, since you mentioned my question as an example of overrepresentation in issue #10 I wanted to clarify: The sequened genome is human material, which aligned overall ok with the human genome (tried it with nanoEM https://github.com/yos-sk/nanoEM) before trimming. To improve alignment I wanted to get rid of the adapters. There is a combination of Illumina, PCR-amplification and nanopore adapters, so I think the high match with coronavirus somehow corresponds to the adapters generated on virus-basis. I don't have the sequence for all the adapters since they are still protected by the respective companies.

You were right with the 0% killed issue being due to size; I initially used a large file with all fastq files merged together; and now switched to a loop that trimmes all the single fastq files one after the other. So thanks a lot for the offer to help with splitting/merging, at least for me I've found a solution.

I had some issues with nanoEM, so now I'm trying bismark, again struggeling, but still optimistic that it will align in the end, but that's another story. Thank you again for your help!

qbonenfant commented 1 year ago

Glad to have helped you.

llumina, PCR-amplification and nanopore adapters. Porechop_ABI should not have too much trouble processing this kind of adapters. One of our test set contains 3 adapters end to end, and is trimmed just fine. As long as they are the same through the whole dataset, and your sampling window is large enough to capture it all, it should work.

Anyway, since your problem is solved, I will close this issue now. Have a nice day.