UltraPlexing is a highly effective method for multiplexed long-read sequencing in the context of hybrid microbial genome assembly. Ultraplexing removes the need for molecular barcodes and assigns each long read to the short-read assembly graph it is most compatible with. While maintaining excellent assembly quality, Ultraplexing enables at least a doubling of the maximum number of samples per flow cell on the Nanopore platform, and a reduction in reagent costs and hands-on-time by a factor of 2.
To apply the UltraPlexing approach, simply pool equal amounts of DNA from the samples you want to multiplex, generate long-read sequencing data, and use the UltraPlexing algorithm to demultiplex the data. In order to apply UltraPlexing, short-read sequencing data for the same samples needs to be available at the time of analysis. If possible, optimize the DNA extration and library preparation processes for read length, as the ability of the UltraPlexing algorithm to assign long reads to isolates improves with increasing read lengths.
A preprint with accuracy evaluations can be fount at: https://www.biorxiv.org/content/10.1101/680827v2
Long reads are generated in simple pooled sequencing runs. The Ultraplexing algorithm determines the most likely source genome for each long read by carrying out a comparison between the read and the de Bruijn graphs of the sequenced sample genomes, inferred from short-read data. Hybrid assembly of sample-specific long and short reads enables the recovery of complete bacterial genomes.
The programm was tested on the following Operating Systems:
Please modify UltraPlexer.pl
so that it contains the correct path to your installation of Cortex (line 16). The algorithm expects to find Cortex binaries for k = 31 with 20, 40 and 60 colors.
We generated and tested a Docker image, containing the OS "Ubuntu 18.04", the above mentionen prerequisites and all data, provided in this GitHub repository. Instead of installing all required programming languages and packages yourself, you should be able to run the Ultraplexer algorithm in a container, created by this Docker image.
The docker image can be found at: https://hub.docker.com/r/diltheygenomicslab/ultraplexer
Use the command docker pull diltheygenomicslab/ultraplexer:0.1.2
to get the current version.
The Dockerfile to this image can be found at: https://osf.io/4m9vh/
Known issue: If the Ultraplexing run produces an an error like
could not allocate hash table of size 209715200
Error: Giving up - unable to allocate memory for the hash table
and subsequently stops due to not finding a certain file, pls try to add the option "--allSamples_cortex_height 20
" to the first Utraplexer call. This error occurs, if the algorithm can not allocate the memory needed (by estimation), to store all calculated data. The before mentioned option reduces the memory allocated by the UltraPlexer, which in most cases solves the problem.
perl UltraPlexer.pl --prefix prefix1 --action classify --samples_file /path/to/samplefile/samplefile1.txt --longReads_FASTQ /path/to/longreads/longreads1.fastq
perl UltraPlexer.pl --prefix prefix1 --action generateCallFile --samples_file /path/to/samplefile/samplefile1.txt
perl UltraPlexer.pl --prefix prefix1 --action generateCallFile --samples_file /path/to/samplefile/samplefile1.txt --classificationSource random
perl UltraPlexer.pl = The UltraPlexing algorithm.
--prefix prefix1 = Your chosen prefix for the UltraPlexer run.
--action classify / generateCallFile = The command to classify the long-reads (classify
) or to generate an output file from the classified data (generateCallFile
).
--samples_file /path/to/samplefile/samplefile1.txt = A tab-separated file containing the isolate ID, the path to the illumina_R1.fastq file, and the path to the illumina_R2.fastq file. One line per isolate.
Isolate_1 /Data/isolate_1_R1.fastq /Data/isolate_1_R2.fastq
Isolate_17 /Data/isolate_17_R1.fastq /Data/isolate_17_R2.fastq
MRSA_H4 /Data/MRSA_H4_R1.fastq /Data/MRSA_H4_R2.fastq
Benjamin /Data/Benjamin_R1.fastq /Data/Benjamin_R2.fastq
…
--longReads_FASTQ /path/to/longreads/longreads1.fastq = A FASTQ file containing the long reads to be classified.
--classificationSource random = The command to generate a random assignment of long reads to isolates (useful for benchmarking).
(examplary for prefix mixed_bacteria_10x
):
mixed_bacteria_10x.classification_k19.done = This flag file is produced when the UltraPlexer finished running correctly.
mixed_bacteria_10x.classification_k19 = This file is produced by the classify
command and contains intermediate read classification data.
mixed_bacteria_10x.classification_k19.called_kmers = This file is produced by the generateCallFile
command and contains, for each read, the isolate it has been assigned to, and a quality metric.
Read_1 MRSA_H4 0.886902934417435
Read_332 Isolate_1 0.906668691485839
Read_336 Benjamin 0.895056000007794
Read_4100 Isolate_1 0.912532884787109
…
mixed_bacteria_10x.classification_k19.called_random = This file is produced when specifying the --classificationSource random
option. It contains a random allocation of reads to isolates.
(exemplary for prefix mixed_bacteria_10x
)
perl create_kmer_based_fastq_for_real_data.pl mixed_bacteria_10x.classification_k19.called_kmers path/to/longreads/longreads1.fastq mixed_bacteria_10x
perl create_kmer_based_fastq_for_real_data.pl = The script that produces fastq files from the calling table.
mixed_bacteria_10x.classification_k19.called_kmers = The calling table from the UltraPlexer run.
path/to/longreads/longreads1.fastq = The used long-read file.
mixed_bacteria_10x = Prefix for the run.
mixed_bacteria_10x-isolate1-predicted_reads.fastq = A fastq file named after the run (mixed_bacteria_10x) and the isolate ID (isolate1), ending with “predicted_reads.fastq”.
In the following we exemplary describe how to simulate reads, create a read-pool, run the Ultraplexing algorithm and assemble the assigned reads, on basis of three random plasmids. The nessecary data can be found in the folder “Example1”:
After downloading the Folder "Example1" you just need to switch to it via terminal and call the necessary commands in the further explained order (assumed, that all the requirements are met).
If you don't want to simulate data or the simulation is just not possible on your device, you can skip Step 1 and 2 (simulation and creation of the long-read pool) and use the long-read pool and other needed data we provided.
Therefor you need to download the zip-file "Example1_simulated_example_data.zip
" from
https://uni-duesseldorf.sciebo.de/s/oHFl3FCArhPhHb5
and copy the content of the unzipped Folder "Example1_simulated_example_data
" (Sim_Pipeline, example1_plasmid_ids_and_pathways.txt, example1_plasmid_read_pool.fastq and example1_plasmid_stats.txt) into your Folder "Example1
". Before continuing the pipeline from step 3. "Ultraplexing:..." you need to open "example1_plasmid_ids_and_pathways.txt
" and change the absolute pathways of the simulated short-read files (contained in the folder "Sim_Pipeline
"), to the the pathways fitting your storage location.
perl simulation_pipeline.pl example1_list_of_plasmids.txt 8500 150
Important: Pathways for pbsim, pbsim qc-model and wgsim need to be replaced in the script "simulation_pipeline.pl" (line 10-12) to fit your installations, before running it.
Important: Please keep these numerical parameters for the example run as they are, since the scripts are still hard-coded at the moment for this mean lengh and coverage.
perl create_pool.pl example1_list_of_plasmids.txt 3 10000000 example1_plasmid
perl UltraPlexer.pl --prefix example1_plasmid --action classify --samples_file example1_plasmid_ids_and_pathways.txt --longReads_FASTQ example1_plasmid_read_pool.fastq
perl UltraPlexer.pl --prefix example1_plasmid --action generateCallFile --samples_file example1_plasmid_ids_and_pathways.txt
perl create_kmer_based_fastq_for_simulations.pl example1_plasmid.classification_k19.called_kmers example1_plasmid_read_pool.fastq example1_plasmid
(examplary for Plasmid1
)
perl parse_calling_tbl.pl example1_plasmid.classification_k19.called_kmers
example1_plasmid.classification_k19.called_kmers Summary Correct_Reads: 1147 False_Reads: 47 Ratio_Correct_Reads: 0.960636515912898
Plasmid1 false: 4 Plasmid1 true: 394
Plasmid2 false: 16 Plasmid2 true: 382
Plasmid3 false: 27 Plasmid3 true: 371
Here you can see, that over 96% of the simulated reads were assigned correctly. The missing <4% are miss-assignments due to sequence homology, or do not really affect hybrid assemblies (as we found out in our experiments).
(examplary for predicted reads for Plasmid1
)
/gpfs/project/dilthey/software/Unicycler/unicycler-runner.py --spades_path /software/SPAdes/3.11.1/ivybridge/bin/spades.py --racon_path /gpfs/project/dilthey/software/racon/bin/racon --pilon_path /software/pilon/1.22/pilon-1.22.jar -t 2 -1 Sim_Pipeline/Plasmid1_l8500_c150/Plasmid1_l8500_c150-filtered_R1.fastq -2 Sim_Pipeline/Plasmid1_l8500_c150/Plasmid1_l8500_c150-filtered_R2.fastq -l example1_plasmid-Plasmid1-predicted_reads.fastq -o example1_plasmid-Plasmid1-predicted_reads_unicycler
Important: Pathways for unicycler, spades, racon and pilon need to be replaced to fit your installations.
(examplary for Plasmid1
)
1.
nucmer -p Plasmid1 example1_plasmid-Plasmid1-predicted_reads_unicycler/assembly.fasta example1_plasmid-Plasmid1-true_reads_unicycler/assembly.fasta
2.
mummerplot2 Plasmid1.delta --png -p Plasmid1
1.
1.
If you take a look at the produced graphics, you will see, that the assemblies of the predicted reads align perfectly to the assemblies of the true reads.