notice: please update to 1.7.5+, which fixed the bug on the multiplicity estimation of self-loop vertices.
This toolkit assemblies organelle genome from genomic skimming data.
It achieved the best performance overall both on simulated and real data and was recommended as the default for chloroplast genome assembly in a third-party comparison paper (Freudenthal et al. 2020. Genome Biology).
Please denote the version of GetOrganelle as well as the dependencies in your manuscript for reproducible science.
Citation: Jian-Jun Jin, Wen-Bin Yu, Jun-Bo Yang, Yu Song, Claude W. dePamphilis, Ting-Shuang Yi, De-Zhu Li. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biology 21, 241 (2020). https://doi.org/10.1186/s13059-020-02154-5
License: GPL https://www.gnu.org/licenses/gpl-3.0.html
Please also cite the dependencies if used:
GetOrganelle is currently maintained under Python 3.7.0, but designed to be compatible with versions higher than 3.5.1 and 2.7.11. It was built for Linux and macOS. Windows Subsystem Linux is currently not supported, we are working on this.
The easiest way to install GetOrganelle and its dependencies is using conda:
conda install -c bioconda getorganelle
You have to install Anaconda or Miniconda before using the above command. If you don't like conda, or want to follow the latest updates, you can find more installation options here (my preference).
After installation of GetOrganelle v1.7+, please download and initialize the database of your preferred organelle genome type (embplant_pt, embplant_mt, embplant_nr, fungus_mt, fungus_nr, animal_mt, and/or other_pt). Supposing you are assembling chloroplast genomes:
get_organelle_config.py --add embplant_pt,embplant_mt
If connection keeps failing, please manually download the latest database from GetOrganelleDB and initialization from local files.
The database will be located at ~/.GetOrganelle
by default, which can be changed via the command line parameter --config-dir
, or via the shell environment variable GETORG_PATH
(see more here).
Download a simulated Arabidopsis thaliana WGS dataset:
wget https://github.com/Kinggerm/GetOrganelleGallery/raw/master/Test/reads/Arabidopsis_simulated.1.fq.gz
wget https://github.com/Kinggerm/GetOrganelleGallery/raw/master/Test/reads/Arabidopsis_simulated.2.fq.gz
then verify the integrity of downloaded files using md5sum
:
md5sum Arabidopsis_simulated.*.fq.gz
# 935589bc609397f1bfc9c40f571f0f19 Arabidopsis_simulated.1.fq.gz
# d0f62eed78d2d2c6bed5f5aeaf4a2c11 Arabidopsis_simulated.2.fq.gz
# Please re-download the reads if your md5 values unmatched above
then do the fast plastome assembly (memory: ~600MB, CPU time: ~60s):
get_organelle_from_reads.py -1 Arabidopsis_simulated.1.fq.gz -2 Arabidopsis_simulated.2.fq.gz -t 1 -o Arabidopsis_simulated.plastome -F embplant_pt -R 10
You are going to get a similar running log as here and the same result as here.
Find more real data examples at GetOrganelle/wiki/Examples, GetOrganelleGallery and GetOrganelleComparison.
Find more organelle genome assembly instruction at GetOrganelle/wiki.
In most cases, what you actually need to do is just typing in one simple command as suggested in Recipes. But you are still highly recommended reading the following minimal introductions:
The green workflow in the flowchart below shows the processes of get_organelle_from_reads.py
.
Input data
Currently, get_organelle_from_reads.py
was written for illumina pair-end/single-end data (fastq or fastq.gz). We recommend using adapter-trimmed raw reads without quality control.
Usually, >1G per end is enough for plastome for most normal angiosperm samples,
and >5G per end is enough for mitochondria genome assembly.
Since v1.6.2, get_organelle_from_reads.py
will automatically estimate the read data it needs, without user assignment nor data reducing (see flags --reduce-reads-for-coverage
and --max-reads
).
Main Options
-w
The value word size, like the kmer in assembly, is crucial to the feasibility and efficiency of this process.
The best word size changes upon data and will be affected by read length, read quality, base coverage, organ DNA percent and other factors.
By default, GetOrganelle would automatically estimate a proper word size based on the data characters.
Although the automatically-estimated word size value does not ensure the best performance nor the best result,
you do not need to adjust this value (-w
) if a complete/circular organelle genome assembly is produced,
because the circular result generated by GetOrganelle is highly consistent under different options and seeds.
The automatically estimated word size may be screwy in some animal mitogenome data due to inaccurate coverage estimation,
for which you fine-tune it instead.
-k
The best kmer(s) depend on a wide variety of factors too.
Although more kmer values add the time consuming, you are recommended to use a wide range of kmers to benefit from the power of SPAdes.
Empirically, you should include at least including one small kmer (e.g. 21
) and one large kmer (85
) for a successful organelle genome assembly.
The largest kmer in the gradient may be crucial to the success rate of achieving the complete circular organelle genome.
-s
GetOrganelle takes the seed (fasta format; if this was not provided,
the default is GetOrganelleLib/SeedDatabase/*.fasta
) as probe,
the script would recruit target reads in successive rounds (extending process).
The default seed works for most samples, but using a complete organelle genome sequence of a related species as the seed would help the assembly in many cases
(e.g. degraded DNA samples, fastly-evolving in animal/fungal samples; see more here).
Key Results
The key output files include
*.path_sequence.fasta
, each fasta file represents one type of genome structure*.selected_graph.gfa
, the organelle-only assembly graphget_org.log.txt
, the log fileextended_K*.assembly_graph.fastg
, the raw assembly graphextended_K*.assembly_graph.fastg.extend_embplant_pt-embplant_mt.fastg
, a simplified assembly graph extended_K*.assembly_graph.fastg.extend_embplant_pt-embplant_mt.csv
, a tab-format contig label file for bandage visualizationYou may delete the files other than above if the resulting genome is complete (indicated in the log file and the name of the *.fasta
).
You are expected to obtain the complete organelle genome assembly for most animal/fungal mitogenomes and plant chloroplast genomes
(see here for nuclear ribosomal DNAs) with the recommended recipes.
If GetOrganelle failed to generate the complete circular genome (produce *scaffolds*path_sequence.fasta
),
please follow here to adjust your parameters for a second run.
You could also use the incomplete sequence to conduct downstream analysis.
The blue workflow in the chat below shows the processes of get_organelle_from_assembly.py
.
Input data & Main Options
-g
The input must be a FASTG or GFA formatted assembly graph file.
If you input an assembly graph assembled from total DNA sequencing using third-party a de novo assembler (e.g. Velvet),
the assembly graph may includes a great amount of non-target contigs.
You may want to use --min-depth
and --max-depth
to greatly reduce the computational burden for target extraction.
If you input an organelle-equivalent assembly graph
(e.g. manually curated and exported using Bandage), you may use --no-slim
.
Key Results
The key output files include
*.path_sequence.fasta
, one fasta file represents one type of genome structure*.fastg
, the organelle related assembly graph to report for improvement and debug*.selected_graph.gfa
, the organelle-only assembly graphget_org.log.txt
, the log filePlease refer to the GetOrganelle FAQ to fine-tune the arguments, especially concerning word size, memory, and clock time.
Embryophyta
To assembly Embryophyta plant plastid genome (plastome), e.g. using 2G raw data of 150 bp paired reads, typically I use:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o plastome_output -R 15 -k 21,45,65,85,105 -F embplant_pt
or in a draft way:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o plastome_output --fast -k 21,65,105 -w 0.68 -F embplant_pt
or in a slow and memory-economic way:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o plastome_output -R 30 -k 21,45,65,85,105 -F embplant_pt --memory-save
To assembly Embryophyta plant mitochondria genome (mitogenome), usually you need more than 5G raw data:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o mitochondria_output -R 20 -k 21,45,65,85,105 -P 1000000 -F embplant_mt
To assembly Embryophyta plant nuclear ribosomal RNA (18S-ITS1-5.8S-ITS2-26S):
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -o nr_output -R 10 -k 35,85,115 -F embplant_nr
Non-embryophyte
Non embryophyte plastomes and mitogenomes can be divergent from the embryophyte. We have not explored it very much. But many users have successfully assemble them using GetOrganelle using the default database or a customized database.
There is a built-in other_pt
mode and prepared default database for the non embryophyte plastomes. I would start with -F other_pt
and similar options as in the embplant_pt
part. However, there is no such built-in mode for non embryophyte mitogenomes. Considering that the sequences may be highly divergent from embplant_mt, besides using similar options as in the embplant_mt
part, I would make a pair of customized seed database and label database, then use them to run GetOrganelle following the guidance here.
Fungus
To assembly fungus mitochondria genome:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -R 10 -k 21,45,65,85,105 -F fungus_mt -o fungus_mt_out
To assembly fungus nuclear ribosomal RNA (18S-ITS1-5.8S-ITS2-28S):
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -R 10 -k 21,45,65,85,105 -F fungus_nr -o fungus_nr_out
Animal
To assembly animal mitochondria:
get_organelle_from_reads.py -1 forward.fq -2 reverse.fq -R 10 -k 21,45,65,85,105 -F animal_mt -o animal_mt_out
Animal nuclear ribosomal RNA will be available in the future. Issue136 is the place to follow.
There are as many available organelle types as the From Reads
section (see more by get_organelle_from_assembly.py -h
), but the simplest usage is not that different. Here is an example to extract the plastid genome from an existing assembly graph (*.fastg
/*.gfa
; e.g. from long-read sequencing assemblies):
get_organelle_from_assembly.py -F embplant_pt -g ONT_assembly_graph.gfa
See a brief illustrations of those arguments by typing in:
get_organelle_from_reads.py -h
or see the detailed illustrations:
get_organelle_from_reads.py --help
The same brief -h
and verbose --help
menu can be find for get_organelle_from_assembly.py
.
You may also find a summary of above information here at Usage.
Please check GetOrganelle wiki page first. If your question is running specific, please attach the get_org.log.txt
file and the post-slimming assembly graph (assembly_graph.fastg.extend_*.fastg
, could be Bandage-visualized *.png format to protect your data privacy).
Although older versions like 1.6.3/1.7.1/1.7.6 may be more stable, but we always strongly encourage you to keep updated. GetOrganelle was actively updated with new fixes and new features, but new bugs too. So if you catch one, please do not be surprised and report it to us. We usually have quick response to bugs.
Find Questions & Answers at GetOrganelle Discussions: Recommended
This was previously located at GetOrganelle Issues where you may find old Q&A
Report Bugs & Issues at GetOrganelle Issues:
Please avoid duplicate and miscellaneous issues
QQ group (ID: 908302723): only for mutual help, and we will no longer reply to questions there
Do NOT directly write to us with your questions, instead please post the questions publicly, using above platforms (we will be informed automatically) or any other platforms (inform us of it). Our emails (jianjun.jin@columbia.edu, yuwenbin@xtbg.ac.cn) are only for receiving public question alert and private data (if applied) associated with those public questions. When you send your private data to us, enclose the email with a link where you posted the question. Our only reply emails will be a receiving confirmation, while our answers will be posted in a public place.