SebastianMeyer1989 / UltraPlexer

The UltraPlexer is a kmer-based tool that allows assigning non-barcoded long-read sequences generated by the Oxford Nanopore Technology to isolates, by matching them to barcoded short-read sequences generated by Illumina Technology.
MIT License
10 stars 2 forks source link

UltraPlexer

Introduction

UltraPlexing is a highly effective method for multiplexed long-read sequencing in the context of hybrid microbial genome assembly. Ultraplexing removes the need for molecular barcodes and assigns each long read to the short-read assembly graph it is most compatible with. While maintaining excellent assembly quality, Ultraplexing enables at least a doubling of the maximum number of samples per flow cell on the Nanopore platform, and a reduction in reagent costs and hands-on-time by a factor of 2.

To apply the UltraPlexing approach, simply pool equal amounts of DNA from the samples you want to multiplex, generate long-read sequencing data, and use the UltraPlexing algorithm to demultiplex the data. In order to apply UltraPlexing, short-read sequencing data for the same samples needs to be available at the time of analysis. If possible, optimize the DNA extration and library preparation processes for read length, as the ability of the UltraPlexing algorithm to assign long reads to isolates improves with increasing read lengths.

A preprint with accuracy evaluations can be fount at: https://www.biorxiv.org/content/10.1101/680827v2

Overview of the Ultraplexing approach

alt text

Long reads are generated in simple pooled sequencing runs. The Ultraplexing algorithm determines the most likely source genome for each long read by carrying out a comparison between the read and the de Bruijn graphs of the sequenced sample genomes, inferred from short-read data. Hybrid assembly of sample-specific long and short reads enables the recovery of complete bacterial genomes.

Program Requirements and Installation

The programm was tested on the following Operating Systems:

The following programming languages and packages need to be installed:

Please modify UltraPlexer.pl so that it contains the correct path to your installation of Cortex (line 16). The algorithm expects to find Cortex binaries for k = 31 with 20, 40 and 60 colors.

Alternative to installing prerequisites:

We generated and tested a Docker image, containing the OS "Ubuntu 18.04", the above mentionen prerequisites and all data, provided in this GitHub repository. Instead of installing all required programming languages and packages yourself, you should be able to run the Ultraplexer algorithm in a container, created by this Docker image.

The docker image can be found at: https://hub.docker.com/r/diltheygenomicslab/ultraplexer

Use the command docker pull diltheygenomicslab/ultraplexer:0.1.2 to get the current version.

The Dockerfile to this image can be found at: https://osf.io/4m9vh/

Running the UltraPlexer

Known issue: If the Ultraplexing run produces an an error like

could not allocate hash table of size 209715200

Error: Giving up - unable to allocate memory for the hash table

and subsequently stops due to not finding a certain file, pls try to add the option "--allSamples_cortex_height 20" to the first Utraplexer call. This error occurs, if the algorithm can not allocate the memory needed (by estimation), to store all calculated data. The before mentioned option reduces the memory allocated by the UltraPlexer, which in most cases solves the problem.

1. Classify long-reads:

perl UltraPlexer.pl --prefix prefix1 --action classify --samples_file /path/to/samplefile/samplefile1.txt --longReads_FASTQ /path/to/longreads/longreads1.fastq

2. Generate human-readable classified output from classification file:

perl UltraPlexer.pl --prefix prefix1 --action generateCallFile --samples_file /path/to/samplefile/samplefile1.txt

3. Generate human-readable random output from classification file (as a comparison):

perl UltraPlexer.pl --prefix prefix1 --action generateCallFile --samples_file /path/to/samplefile/samplefile1.txt --classificationSource random

Input

perl UltraPlexer.pl = The UltraPlexing algorithm.

--prefix prefix1 = Your chosen prefix for the UltraPlexer run.

--action classify / generateCallFile = The command to classify the long-reads (classify) or to generate an output file from the classified data (generateCallFile).

--samples_file /path/to/samplefile/samplefile1.txt = A tab-separated file containing the isolate ID, the path to the illumina_R1.fastq file, and the path to the illumina_R2.fastq file. One line per isolate.

Example:
Isolate_1   /Data/isolate_1_R1.fastq    /Data/isolate_1_R2.fastq
Isolate_17  /Data/isolate_17_R1.fastq   /Data/isolate_17_R2.fastq
MRSA_H4         /Data/MRSA_H4_R1.fastq          /Data/MRSA_H4_R2.fastq
Benjamin    /Data/Benjamin_R1.fastq         /Data/Benjamin_R2.fastq
…

--longReads_FASTQ /path/to/longreads/longreads1.fastq = A FASTQ file containing the long reads to be classified.

--classificationSource random = The command to generate a random assignment of long reads to isolates (useful for benchmarking).

Output

(examplary for prefix mixed_bacteria_10x):

mixed_bacteria_10x.classification_k19.done = This flag file is produced when the UltraPlexer finished running correctly.

mixed_bacteria_10x.classification_k19 = This file is produced by the classify command and contains intermediate read classification data.

mixed_bacteria_10x.classification_k19.called_kmers = This file is produced by the generateCallFile command and contains, for each read, the isolate it has been assigned to, and a quality metric.

Example:
Read_1          MRSA_H4         0.886902934417435
Read_332    Isolate_1   0.906668691485839
Read_336    Benjamin    0.895056000007794
Read_4100   Isolate_1   0.912532884787109
…

mixed_bacteria_10x.classification_k19.called_random = This file is produced when specifying the --classificationSource random option. It contains a random allocation of reads to isolates.

Creating fastq-files for further hybrid assemblies

(exemplary for prefix mixed_bacteria_10x)

Call:

perl create_kmer_based_fastq_for_real_data.pl mixed_bacteria_10x.classification_k19.called_kmers path/to/longreads/longreads1.fastq  mixed_bacteria_10x

Input:

perl create_kmer_based_fastq_for_real_data.pl = The script that produces fastq files from the calling table.

mixed_bacteria_10x.classification_k19.called_kmers = The calling table from the UltraPlexer run.

path/to/longreads/longreads1.fastq = The used long-read file.

mixed_bacteria_10x = Prefix for the run.

Output:

mixed_bacteria_10x-isolate1-predicted_reads.fastq = A fastq file named after the run (mixed_bacteria_10x) and the isolate ID (isolate1), ending with “predicted_reads.fastq”.

Example Run

In the following we exemplary describe how to simulate reads, create a read-pool, run the Ultraplexing algorithm and assemble the assigned reads, on basis of three random plasmids. The nessecary data can be found in the folder “Example1”:

After downloading the Folder "Example1" you just need to switch to it via terminal and call the necessary commands in the further explained order (assumed, that all the requirements are met).

If you don't want to simulate data or the simulation is just not possible on your device, you can skip Step 1 and 2 (simulation and creation of the long-read pool) and use the long-read pool and other needed data we provided. Therefor you need to download the zip-file "Example1_simulated_example_data.zip" from

https://uni-duesseldorf.sciebo.de/s/oHFl3FCArhPhHb5

and copy the content of the unzipped Folder "Example1_simulated_example_data" (Sim_Pipeline, example1_plasmid_ids_and_pathways.txt, example1_plasmid_read_pool.fastq and example1_plasmid_stats.txt) into your Folder "Example1". Before continuing the pipeline from step 3. "Ultraplexing:..." you need to open "example1_plasmid_ids_and_pathways.txt" and change the absolute pathways of the simulated short-read files (contained in the folder "Sim_Pipeline"), to the the pathways fitting your storage location.

1. Simulation (59s runtime, 1 CPUs, <1gb used memory)

Requirements:

Call:

perl simulation_pipeline.pl example1_list_of_plasmids.txt 8500 150

Important: Pathways for pbsim, pbsim qc-model and wgsim need to be replaced in the script "simulation_pipeline.pl" (line 10-12) to fit your installations, before running it.

Input:

Important: Please keep these numerical parameters for the example run as they are, since the scripts are still hard-coded at the moment for this mean lengh and coverage.

Output:

2. Creating a shuffled long-read pool (1s runtime, 1 CPUs, <1gb used memory)

Requirements:

Call:

perl create_pool.pl example1_list_of_plasmids.txt 3 10000000 example1_plasmid

Input:

Output:

3. Ultraplexing: Classification of long-reads (2m32s runtime, 1 CPUs, <7gb used memory)

Requirements:

Call:

perl UltraPlexer.pl --prefix example1_plasmid --action classify --samples_file example1_plasmid_ids_and_pathways.txt --longReads_FASTQ example1_plasmid_read_pool.fastq

Input:

Output:

4. Ultraplexing: Creating the assignment table (1s runtime, 1 CPUs, <1gb used memory)

Requirements:

Call:

perl UltraPlexer.pl --prefix example1_plasmid --action generateCallFile --samples_file example1_plasmid_ids_and_pathways.txt

Input:

Output:

5. Creating fastq-files for each genome (1s runtime, 1 CPUs, <1gb used memory)

Requirements:

Call:

perl create_kmer_based_fastq_for_simulations.pl example1_plasmid.classification_k19.called_kmers example1_plasmid_read_pool.fastq example1_plasmid

Input:

Output:

6. Comparing true and predicted reads (1s runtime, 1 CPUs, <1gb used memory)

(examplary for Plasmid1)

Requirements:

Call:

perl parse_calling_tbl.pl  example1_plasmid.classification_k19.called_kmers

Input:

Output:

7. Hybrid assembly (1-2h runtime per assembly, 2 CPUs, <2gb used memory)

(examplary for predicted reads for Plasmid1)

Requirements:

Call:

/gpfs/project/dilthey/software/Unicycler/unicycler-runner.py --spades_path /software/SPAdes/3.11.1/ivybridge/bin/spades.py --racon_path /gpfs/project/dilthey/software/racon/bin/racon --pilon_path /software/pilon/1.22/pilon-1.22.jar -t 2 -1 Sim_Pipeline/Plasmid1_l8500_c150/Plasmid1_l8500_c150-filtered_R1.fastq -2 Sim_Pipeline/Plasmid1_l8500_c150/Plasmid1_l8500_c150-filtered_R2.fastq -l example1_plasmid-Plasmid1-predicted_reads.fastq -o example1_plasmid-Plasmid1-predicted_reads_unicycler

Important: Pathways for unicycler, spades, racon and pilon need to be replaced to fit your installations.

Input:

Output:

8. Comparing assemblies of true and predicted reads using nucmer (1s runtime, 1 CPUs, <1gb used memory)

(examplary for Plasmid1)

Requirements:

Calls:

1.

nucmer -p Plasmid1 example1_plasmid-Plasmid1-predicted_reads_unicycler/assembly.fasta example1_plasmid-Plasmid1-true_reads_unicycler/assembly.fasta

2.

mummerplot2 Plasmid1.delta --png -p Plasmid1

Input:

1.

Output:

1.

If you take a look at the produced graphics, you will see, that the assemblies of the predicted reads align perfectly to the assemblies of the true reads.