avierstr / amplicon_sorter

Sorts amplicons from Nanopore sequencing data based on similarity
31 stars 7 forks source link

amplicon_sorter

Amplicon sorter is a tool for reference-free sorting of ONT sequenced amplicons based on their similarity in sequence and length and for building solid consensus sequences. The limit for separating closely related species within a sample is currently around 95 - 96%.

For more detailed explanation, please read Amplicon_sorter_manual.pdf.

(Only works on Linux because it uses multiprocessing)

Requirements:

Citation:

Vierstraete, A. R., & Braeckman, B. P. (2022). Amplicon_sorter: A tool for reference-free amplicon sorting based on sequence similarity and for building consensus sequences. Ecology and Evolution, 12, e8603. https://doi.org/10.1002/ece3.8603

Keywords

amplicon sequencing, MinION, Oxford Nanopore Technologies, consensus, reference free, biodiversity, DNA barcoding, metabarcoding, metagenetics, PCR, sorting

Options:

-i, --input: Input file in fastq or fasta format. Also a folder can be given as input and will be scanned for .fasta or .fastq files to process. Make sure the input file(s) is (are) named as .fasta or .fastq because it replaces the extension in parts of the script.

-o, --outputfolder: Save the results in the specified outputfolder. Default = same folder as the inputfile in a subfolder with the name of the input file.

-min, --minlength: Minimum readlenght to process. Default=300

-max, --maxlength: Maximum readlenght to process. Default=No limit

-maxr, --maxreads: Maximum number of reads to process. Default=10000

-ar, --allreads: Use all reads from the inputfile between length limits. This argument is still limited with --maxreads to have a hard limit for large files.

-np, --nprocesses: Number of processors to use. Default=1

-sfq, --save_fastq: Save the results also in fastq files (fastq files will not contain the consensus sequence)

-ra, --random: Takes random reads from the inputfile. The script does NOT compare al sequences with each other, it compares batches of 1.000 with each other. You can use this option and sample reads several times and compare them with other reads in other batches. So it is possible to have an inputfile with 10.000 reads and sample random 20.000 reads from that inputfile. The script will run 20 batches of 1.000 reads. This way, the chance to find more reads with high similarity is increasing when there are a lot of different amplicons in the sample. No need to do that with samples with 1 or 2 amplicons.

-aln, --alignment: option to save the alignment that is used to create the consensus (max 200 reads, fasta format). Can be interesting to check how the consensus is created.

-amb, --ambiguous: option to save the consensus with ambiguous nucleotides, e. g. to find SNP positions (this is still a bit experimental, sometimes errors at the very beginning and end of the consensus).

Less important options:

-a, --all: Compare all selected reads with each other. Only advised for a small number of reads (< 10000) because it is time-consuming. (In contrast with the default settings where it compares batches of 1.000 with each other)

-ldc, --length_diff_consensus: Length difference (in %) allowed between consensuses to COMBINE groups based on the consensus sequence (value between 0 and 200). Default=8.0. This can be interesting if you have amplicons of different length, the shorter ones are nested sequence of the longer ones and you want to combine those in one group.

-sg, --similar_genes: Similarity to sort genes in groups (value between 50 and 100). Default=80.0

-ssg, --similar_species_groups: Similarity to CREATE species groups (value between 50 and 100). Default=Estimate

-ss, --similar_species: Similarity to ADD sequences to a species group (value between 50 and 100). Default=85.0

-sc, --similar_consensus: Similarity to COMBINE groups based on the consensus sequence (value between 50 and 100). Default=96.0

-ho, --histogram_only: Only makes a read length histogram. Can be interesting to see what the minlength and maxlength setting should be.

-so, --species_only: Only creates species groups and sort to species level. This can only be done if the whole script has run once without this option. This is to save time if you play with the --similar_species and/or --similar_species_groups parameters. It is using the same --maxreads, --minlength, --maxlength data that is produced in the first part of the script, so those 3 parameters are ignored here.

How it works (in short):

  1. The script reads the inputfile and can optionally create a read length histogram of all the reads in the file.
  2. It starts processing a selection of the reads (based on minlength, maxlenght, maxreads). It saves the result in .group files that contain reads of the same gene (e.g. group_1 with 18S reads, group_2 with COI reads, group_3 with ITS reads).
  3. It processes the group files to sort out the genes to species or genus level and saves this to different files. (e.g. file_1_1.fasta is 18S from species1, file1_2.fasta is 18S from species2, ...)
  4. Each outputfile contains at the end the consensus file of the sequences in the file.
  5. Files are produced per group with all consensus sequences for that group. A file is produced with all consensus sequences of all groups.
  6. Files are produced with 'unique' sequences (script does not find a group where it belongs to according to the settings)

amplicon_sorter

Workflow:

Filter your inputfile for reads >= Q12 with NanoFilt (https://github.com/wdecoster/nanofilt) or other quality filtering software. Use that Q12 inputfile for Amplicon_sorter. (Lower quality reads can be used but will result in longer processing time and a lower percentage of reads that will assigned to a species.

Copy the Amplicon_sorter.py script in the same folder as your inputfile.

Command examples:

Process several files in inputfolder:

python3 amplicon_sorter.py -i infolder -min 650 -max 1200 -ar -maxr 100000 -np 8:

Process all files in 'infolder' with length between 650 and 1200 bp, use all reads available, with a maximum of 100000 reads if more are available, process on 8 cores. The result will be saved in the 'infolder' in subfolders with the same name as the inputfiles.

Produce a read length histogram of your inputfile:

python3 amplicon_sorter.py -i infile.fastq –o outputfolder -min 650 -max 750 -ho:

produce the readlength histogram of infile.fastq in folder outputfolder. This gives you the information on the number of reads between 650 and 750 bp.

Sample with one species amplicon of 750 bp:

python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 700 -max 800 -maxr 1000

process infile.fastq with default settings, save in folder outputfolder, run on 8 cores, minimum length of reads = 700, max length of reads = 800, use 1000 reads. This will sample the first 1000 reads between 700 and 800 bp of the inputfile. If you add the -ra (random) option to the command line, it will sample 1000 random reads between 700 and 800 bp.

Sample with 2 species: an amplicon of 700 bp and one of 1200 bp:

python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 650 -max 1250 -maxr 2000

Metagenetic sample with several amplicons between 600 and 3000 bp, unknown number of species, 30000 reads in the inputfile:

python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 550 -max 3050 -maxr 30000

Metagenetic sample with several amplicons between 600 and 3000 bp, unknown number of species, 30000 reads in the inputfile, one low abundant species (< 2% reads):

python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 550 -max 3050 -ra -maxr 600000

By random sampling 20x the maximum number of reads, it is possible to find low abundant species.

Parameters depending on the basecaller and flow cell settings:

Guppy v5.xx has a High Accuracy (HAC) and Super Accuracy (SupHAC) option to do the basecalling and sequencing is possible on a 9.4.1 and R10 type of flow cell.

If you are working with species that are more than 95 – 96% similar, it is important to change or finetune some settings of Amplicon_sorter:

Todo:

Release notes:

2024/02/20:

2023/06/19:

2023/03/24:

2023/03/12:

2022/03/28:

2021/12/24: (version of the publication of March 2022 (https://doi.org/10.1002/ece3.8603)

2021/12/19:

2021/12/01:

2021/11/16:

2021/09/21:

2021/09/11:

2021/08/08:

2021/07/13:

2021/05/28:

2021/05/19:

2021/05/13:

2021/05/06:

2020/5/20:

2020/5/6:

2020/4/27:

2020/4/17:

2020/4/3:

2020/3/12:

2020/3/5: