Amplicon sorter is a tool for reference-free sorting of ONT sequenced amplicons based on their similarity in sequence and length and for building solid consensus sequences. The limit for separating closely related species within a sample is currently around 95 - 96%.
For more detailed explanation, please read Amplicon_sorter_manual.pdf.
2 versions:
Requirements:
python3 -m pip install edlib
or conda: conda install bioconda::python-edlib
)
(for Win64: pip: python -m pip install edlib
or conda: conda install conda-forge::edlib
)pip install biopython
or conda: conda install biopython
or in linux sudo apt-get install python3-biopython
)pip install matplotlib
or conda: conda install matplotlib
or in Linux: sudo apt-get install python3-matplotlib
)Vierstraete, A. R., & Braeckman, B. P. (2022). Amplicon_sorter: A tool for reference-free amplicon sorting based on sequence similarity and for building consensus sequences. Ecology and Evolution, 12, e8603. https://doi.org/10.1002/ece3.8603
GNU GPL 3.0
amplicon sequencing, MinION, Oxford Nanopore Technologies, consensus, reference free, biodiversity, DNA barcoding, metabarcoding, metagenetics, PCR, sorting
-i, --input
: Input file in fastq or fasta format. Also a folder can be given as input and will be scanned for .fasta or .fastq files to process. Make sure the input file(s) is (are) named as .fasta or .fastq because it replaces the extension in parts of the script.
-o, --outputfolder
: Save the results in the specified outputfolder. Default = same folder as the inputfile in a subfolder with the name of the input file.
-min, --minlength
: Minimum readlenght to process. Default=300
-max, --maxlength
: Maximum readlenght to process. Default=No limit
-maxr, --maxreads
: Maximum number of reads to process. Default=10000
-ar, --allreads
: Use all reads from the inputfile between length limits. This argument is still limited with --maxreads
to have a hard limit for large files.
-np, --nprocesses
: Number of processors to use. Default=1
-sfq, --save_fastq
: Save the results also in fastq files (fastq files will not contain the consensus sequence)
-ra, --random
: Takes random reads from the inputfile. The script does NOT compare al sequences with each other, it compares batches of 1.000 with each other. You can use this option and sample reads several times and compare them with other reads in other batches. So it is possible to have an inputfile with 10.000 reads and sample random 20.000 reads from that inputfile. The script will run 20 batches of 1.000 reads. This way, the chance to find more reads with high similarity is increasing when there are a lot of different amplicons in the sample. No need to do that with samples with 1 or 2 amplicons.
-aln, --alignment
: option to save the alignment that is used to create the consensus (max 200 reads, fasta format). Can be interesting to check how the consensus is created.
-amb, --ambiguous
: option to save the consensus with ambiguous nucleotides, e. g. to find SNP positions (this is still a bit experimental, sometimes errors at the very beginning and end of the consensus).
-a, --all
: Compare all selected reads with each other. Only advised for a small number of reads (< 10000) because it is time-consuming. (In contrast with the default settings where it compares batches of 1.000 with each other)
-ldc, --length_diff_consensus
: Length difference (in %) allowed between consensuses to COMBINE groups based on the consensus sequence (value between 0 and 200). Default=8.0. This can be interesting if you have amplicons of different length, the shorter ones are nested sequence of the longer ones and you want to combine those in one group.
-sg, --similar_genes
: Similarity to sort genes in groups (value between 50 and 100). Default=80.0
-ssg, --similar_species_groups
: Similarity to CREATE species groups (value between 50 and 100). Default=Estimate
-ss, --similar_species
: Similarity to ADD sequences to a species group (value between 50 and 100). Default=85.0
-sc, --similar_consensus
: Similarity to COMBINE groups based on the consensus sequence (value between 50 and 100). Default=96.0
-ho, --histogram_only
: Only makes a read length histogram. Can be interesting to see what the minlength and maxlength setting should be.
-mac, --macOS
: Option to try if amplicon_sorter crashes on Mac with a M1 processor (I did not get confirmation from users if this works or not).
Filter your inputfile for reads >= Q12 with NanoFilt (https://github.com/wdecoster/nanofilt) or other quality filtering software. Use that Q12 inputfile for Amplicon_sorter. (Lower quality reads can be used but will result in longer processing time and a lower percentage of reads that will assigned to a species.
Copy the Amplicon_sorter.py script in the same folder as your inputfile.
Process several files in inputfolder:
python3 amplicon_sorter.py -i infolder -min 650 -max 1200 -ar -maxr 100000 -np 8
:
Process all files in 'infolder' with length between 650 and 1200 bp, use all reads available, with a maximum of 100000 reads if more are available, process on 8 cores. The result will be saved in the 'infolder' in subfolders with the same name as the inputfiles.
Produce a read length histogram of your inputfile:
python3 amplicon_sorter.py -i infile.fastq –o outputfolder -min 650 -max 750 -ho
:
produce the readlength histogram of infile.fastq in folder outputfolder. This gives you the information on the number of reads between 650 and 750 bp.
Sample with one species amplicon of 750 bp:
python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 700 -max 800 -maxr 1000
process infile.fastq with default settings, save in folder outputfolder, run on 8 cores, minimum length of reads = 700, max length of reads = 800, use 1000 reads. This will sample the first 1000 reads between 700 and 800 bp of the inputfile. If you add the -ra (random) option to the command line, it will sample 1000 random reads between 700 and 800 bp.
Sample with 2 species: an amplicon of 700 bp and one of 1200 bp:
python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 650 -max 1250 -maxr 2000
Metagenetic sample with several amplicons between 600 and 3000 bp, unknown number of species, 30000 reads in the inputfile:
python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 550 -max 3050 -maxr 30000
Metagenetic sample with several amplicons between 600 and 3000 bp, unknown number of species, 30000 reads in the inputfile, one low abundant species (< 2% reads):
python3 amplicon_sorter.py -i infile.fastq -o outputfolder -np 8 -min 550 -max 3050 -ra -maxr 600000
By random sampling 20x the maximum number of reads, it is possible to find low abundant species.
Guppy v5.xx has a High Accuracy (HAC) and Super Accuracy (SupHAC) option to do the basecalling and sequencing is possible on a 9.4.1 and R10 type of flow cell.
If you are working with species that are more than 95 – 96% similar, it is important to change or finetune some settings of Amplicon_sorter:
--similar_species_groups
: this is used to create species groups. The script is looking for the highest similarities between species and uses those to create species groups. When a better basecaller or flow cell is used, the higher this value can be. --similar_consensus
: this parameter is used to merge species groups if the consensus is more than 96% (default for HAC) similar. When you increase this value, you will get more groups from the same species that are not merged. When decreasing this value, it is possible that closely related species are merged in one group. For the SupHAC and/or R10 data, this value can be increased to 98%.2024/10/16:
2024/10/13:
2024/10/07:
2024/02/20:
2023/06/19:
-aln, --alignment
to save the alignment used to create the consensus. -amb, --ambiguous
to save the consensus with ambiguous nucleotides, e. g. to find SNP positions (this is still a bit experimental).-so, --species_only
command line option and better cleanup of temporary files.2023/03/24:
-i, --input
possibilities. A file or folder can be input. When it is a file, only that file will be processed. When it is a folder, it will scan the folder for .fasta or .fastq files and process them all. Keep in mind that all files in a folder will be processed with the same options. -o', '--outputfolder
option. By default it will save the data in same folder as the inputfolder in a subfolder that has the same name as the inputfile. (Results from BC103.fasta will automatically be saved in the folder BC103) This is done for all files in a input folder that are processed. When another outputfolder is given as option, the results will be saved in a subfolder in that folder with the name of the inputfile.-h, --help
display (Thanks russellsmithies for noticing). 2023/03/12:
-ar, --allreads
. This option is still limited by the -maxr, --maxreads
to have a hard limit.-ldc, --length_diff_consensus
: Length difference (in %) allowed between consensuses to COMBINE groups based on the consensus sequence. This can be interesting if you have amplicons of different length, the shorter ones are nested sequences of the longer ones and you want to combine those in one group.2022/03/28:
2021/12/24: (version of the publication of March 2022 (https://doi.org/10.1002/ece3.8603)
2021/12/19:
2021/12/01:
-ho, --histogram_only
is now optional, no longer default.2021/11/16:
--similar_species_groups
: the programs estimates the value from the dataset instead of the default value of 0.93.2021/09/21:
2021/09/11:
--similar_consensus
--all
2021/08/08:
--similar_species_groups
from 0.92 to 0.932021/07/13:
2021/05/28:
2021/05/19:
2021/05/13:
2021/05/06:
2020/5/20:
--maxreads
.2020/5/6:
-ho --histogram_only
) option.2020/4/27:
2020/4/17:
--histogram_only
).--similar_species_groups
).--species_only
) to play with the --similar_species
and --similar_species_groups
parameters without having to start all over. 2020/4/3:
-o --outputfolder
).2020/3/12:
--random
).2020/3/5:
--save_fastq
).Written with StackEdit.