Crossmapper is an automated bioinformatics pipeline for asessing the rate of read crossmapping when two or more organisms are sequenced as one sample. The software can be used for planning such kind of experimental setups as dual- or multiple RNA-seq (mainly for host-pathogen, symbiont and cohabitant interaction studies), metagenomics studies, sequencing and analysis of hybrid species, allele-specific expression studies, and can be extended for the use in large sequencing facilities for resource optimization.
Based on in-silico read simulation and back-mapping to the original genomes of sequenced organisms, Crossmapper allows the users to assess the rate of incorrect unique and multimapped reads to non-corresponding genomes and thus helps to optimize the sequencing parameters such as the read lenght, paired/single-end, mapping parameters, etc., prior of performing the actual sequencing experiment.
Crossmapper is developed by:
Hrant Hovhannisyan (Comparative Genomics group, Centre for Genomic Regulation, Barcelona, Spain, contact email: grant.hovhannisyan@gmail.com) and
Ahmed Hafez (Biotechvana, Valencia, Spain, contact email: ah.hafez@gmail.com ).
Both are part of EU funded Innovative Training Network OPATHY - From Omics to Patient: Improving Diagnostics of Pathogenic Yeasts.
Crossmapper is distribued under the GNU GENERAL PUBLIC LICENSE v3.
Crossmapper is published in the Bioinformatics journal.
We have implemented the Crossmapper in Python 3.6 as an Anaconda package. Thus, the Crossmapper installation is as easy as any other Anaconda package, without a need to solve any dependency issues.
If anaconda or miniconda package managers are installed, simply run:
conda install -c gabaldonlab -c bioconda crossmapper
to install crossmapper and all its dependencies.
The examplified procedure of Crossmapper installation and optional specific enviroment creation can be found in our step-by-step tutorial.
The basic usage arguments of Crossmapper can be found by typing crossmapper -h
in the command line:
usage: crossmapper [-h] [-v] {DNA,RNA} ...
-- crossmapper Software
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
SimulationType:
{DNA,RNA} Simulation type. Choose to simulate either DNA or RNA data
DNA Simulate DNA data
RNA Simulate RNA data
Crossmapper has two running options: DNA and RNA. By running, for example, crossmapper RNA -h
, the user can see the parameters for RNA mode. Optional arguments are the same for DNA mode, but on the bottom of the help page the user can find the arguments specific only for RNA mode.
usage: crossmapper RNA [-h] -g fasta [fasta ...] [-t int] [-e float] [-d int]
[-s int]
(-N int [int ...] | -C float/int [float/int ...])
[-rlay {SE,PE,both}] [-rlen int] [-r float] [-R float]
[-X float] [-S int] [-AMB float] [-hapl] [-o PATH]
[-gb] [-rc] [-gn name [name ...]]
[--mapper-template PATH] [-max_mismatch_per_len float]
[-bact_mode] [-max_mismatch int] -a gtf [gtf ...]
[-star_tmp PATH]
optional arguments:
-h, --help show this help message and exit
-t int, --threads int
Number of cores to be used for all multicore-
supporting steps (default: 1)
-e float, --error float
Base error rate (default: 0.02)
-d int, --outer_dist int
Outer distance between the two reads. For example, in
case of 2x50 reads, d=300 and s=0 the mates will be
200 bp apart from each other. (default: 500)
-s int, --s_dev int Standard deviation of outer distance. (default: 30)
-N int [int ...], --N_read int [int ...]
The number of reads/read pairs to generate. This
parameter can not be used alongside with -C (default:
None)
-C float/int [float/int ...], --coverage float/int [float/int ...]
Generate the number of reads that reaches the
specified coverage. Coverage is calculated as:C =
N*rlen/L, where L is the length of the
genome/transcriptome (default: None)
-rlay {SE,PE,both}, --read_layout {SE,PE,both}
Specify the read configuration - single-end (SE),
paired-end (PE), or both (both). If chosen 'both', the
software will make separate analysis with each
configuration (default: SE)
-rlen int, --read_length int
Specify the read length. Choose from the possible read
lengths available for Illumina machines:
50,75,100,125,150,300. The user can either enter a
specific length, or specify a (!) COMMA-SEPARATED (no
spaces are allowed between commas) list of desired
read lengths. In the latter case, the software will
perform the analysis for all specified values
separatelly and will report mapping statistics in a
form of a graph (default: 50)
-r float, --mut_rate float
Mutation rate. (default: 0.001)
-R float, --indel_fraction float
Fraction of indels. (default: 0.015)
-X float, --indel_extend float
Probability of an indel to be extended. (default: 0.3)
-S int, --random_seed int
Seed for random generator. (default: -1)
-AMB float, --discard_ambig float
Disgard if the fraction of ambiguous bases is higher
than this number. (default: 0.05)
-hapl, --haplotype_mode
Haplotype mode. If specified, the haploid mutations
will be simulated instead of diploid. (default: False)
-o PATH, --out_dir PATH
Specify the output directory for crossmap output
files. (default: crossmap_out)
-gb, --groupBarChart Use a grouped bar chart in the output report instead
of individual bar chart. (default: False)
-rc, --reportCrossmapped
Report all cross mapped reads into csv file. (default:
False)
-gn name [name ...], --genome_names name [name ...]
Specify names of the genomes. The names will appear in
the report file. (default: None)
--mapper-template PATH, --mapper-template PATH
--mapper-template (default: None)
Required Arguments:
-g fasta [fasta ...], --genomes fasta [fasta ...]
Specify the genome files in fasta format. Enter genome
names separated by whitespace. NOTE: Keep the same
order of listing for gtf/gff files (default: None)
Mapper and annotation Arguments:
Arguments specific to STAR Mapper
-max_mismatch_per_len float, --outFilterMismatchNoverReadLmax float
From STAR manual: alignment will be output only if its
ratio of mismatches to *read* length is less than or
equal to this value: for 2x100b, max number of
mismatches is 0.04*200=8 for the paired read.
(default: 0.04)
-bact_mode, --bacterial_mode
This option prohibits spliced alignments for STAR and
it can be used for mapping bacterial data. (default:
False)
-max_mismatch int, --outFilterMismatchNmax int
From STAR manual: alignment will be output only if it
has no more mismatches than this value (default: 10)
-a gtf [gtf ...], --annotations gtf [gtf ...]
Specify the gtf/gff files. Enter the file names
separated by whitespace. NOTE: Keep the same order of
listing as for genome files (default: None)
-star_tmp PATH, --star_temp_dir PATH
Specify a full path to a local temprorary directory,
where all intermediate files of STAR will be written.
This option can be used when running Crossmaper from a
local machine in a server or cluster with SAMBA
connection. (default: ./TMPs)
We have created a Step-by-Step tutorial with detailed description of options and workflow steps performed by Crossmapper. We guide the user from the initial software installation and setup untill the interpretations of the obtained results.
To save time for users, we have precomputed the cross-mapping values for some widely sequences species, namelly human combined with each of the following species: M. musculus, A. thaliana, D. melanogaster, C. elegans, S. cerevisiae and S. typhimurium (please download the files to view correctly). We will expand these list of precomputed values in future.
For any issus with Crossmapper usage, please first refer to our Troubleshooting page. If issues still persist, the users can report bugs/issues in our Github Issues section, or alternativelly can ask questions, report bugs and suggest new features for Crossmapper in our google group.