Uauy-Lab / bioruby-polyploid-tools

Library and tools to deal with polyploid genomics
10 stars 11 forks source link

bio-polyploid-tools

Introduction

This tools are designed to deal with polyploid wheat. The first tool is to design KASP primers, making them as specific as possible.

Installation

gem install bio-polyploid-tools

You need to have in your $PATH the following programs:

The code was originally developed on ruby 2.1, 2.3 and 2.5. It may work on older version. However, it is only actively tested in currently supported ruby versions:

PolyMarker

To run PolyMarker with the CSS wheat contigs, you need to unzip the reference file from ensembl.

polymarker.rb --contigs Triticum_aestivum.IWGSC2.25.dna.genome.fa --marker_list snp_list.csv --output output_folder

The snp_list file must follow the convention ID,Chromosome,SEQUENCE with the SNP inside the sequence in the format [A/T]. As a reference, look at test/data/short_primer_design_test.csv

If you want to use the web interface, visit the PolyMarker webservice at TGAC

The available command line arguments are:

Usage: polymarker.rb [options]
    -c, --contigs FILE               File with contigs to use as database
    -m, --marker_list FILE           File with the list of markers to search from
    -g, --genomes_count INT          Number of genomes (default 3, for hexaploid)
    -s, --snp_list FILE              File with the list of snps to search from, requires --reference to get the sequence using a position
    -t, --mutant_list FILE           File with the list of positions with mutation and the mutation line.
    requires --reference to get the sequence using a position
    -r, --reference FILE             Fasta file with the sequence for the markers (to complement --snp_list)
    -o, --output FOLDER              Output folder
    -e, --exonerate_model MODEL      Model to be used in exonerate to search for the contigs
    -i, --min_identity INT           Minimum identity to consider a hit (default 90)
    -a, --arm_selection arm_selection_embl|arm_selection_morex|arm_selection_first_two
                    Function to decide the chromome arm
    -p, --primer_3_preferences FILE  file with preferences to be sent to primer3
    -v, --variation_free_region INT  If present, avoid generating the common primer if there are homoeologous SNPs within the specified distance (not tested)
    -x, --extract_found_contigs      If present, save in a separate file the contigs with matches. Useful to debug.
    -P, --primers_to_order           If present, saves a file named primers_to_order which contains the KASP tails

Input formats

The following formats are used to define the marker sequences:

Marker list

If the option --marker_list FILE is used, the SNP and the flanking sequence is included in the file. The format contains 3 columns (the order is important):

Example:

BS00068396_51,2A,CGAAGCGATCCTACTACATTGCGTTCCTTTCCCACTCCCAGGTCCCCCTA[T/C]ATGCAGGATCTTGATTAGTCGTGTGAACAACTGAAATTTGAGCGCCACAA

SNP list

If the flanking sequence is unknow, but the position on a reference is available, the option --snp_list can be used and the FASTA file with the reference sequence must be provided with the option --reference. This is to allow the use of a different assembly or set of contigs used for the discovery of the SNPs that are different to the reference given in the option --contigs. The format contains the following positional columns:

Example

IWGSC_CSS_1AL_scaff_110,C,519,A,2A

This file format can be used with snp_positions_to_polymarker.rb to produce the input for the option--marker_list.

Custom reference sequences.

By default, the contigs and pseudomolecules from ensembl are used. However, it is possible to use a custom reference. To define the chromosome where each contig belongs the argument arm_selection is used. The defailt uses ids like: IWGSC_CSS_1AL_scaff_110, where the third field, separated by underscores is used. A simple way to add costum references is to rename the fasta file to follow that convention. Another way is to use the option --arm_selection arm_selection_first_two, where only the first two characters in each contig is used as identifier, useful when pseudomolecules are named after the chromosomes (ie: ">1A" in the fasta file).

If your contigs follow a different convention, in the file ChromosomeArm.rb it is possible to define new parsers, by adding at the begining, with the rest of the parsers a new lambda like:

@@arm_selection_functions[:embl] = lambda do | contig_name|
  arr = contig_name.split('_')
  ret = "U"
  ret = arr[2][0,2] if arr.size >= 3
  ret = "3B" if arr.size == 2 and arr[0] == "v443"
  ret = arr[0][0,2] if arr.size == 1   
  return ret
end

The function should return a 2 character string, when the first is the chromosome number and the second the chromosome group. The symbol in the hash is the name to be used in the argument --arm_selection. If you want your parser to be added to the distribution, feel free to fork and make a pull request.

Using blast

To use blast instead of exonerate, use the following command:

./bin/polymarker.rb --contigs test/data/BS00068396_51_contigs.fa --marker_list test/data/BS00068396_51_for_polymarker.fa  --aligner blast  -a arm_selection_first_two

Release Notes

0.9.7

There was some strange issue with the numbering, so bumped to 0.9.7

0.8.7

0.8.6

0.8.5

0.8.4

ruby tag_stats.rb -b HI.3206.006.Index_2.CS_125RNA_14d_Leaf8.sorted.bam -r /Users/ramirezr/Dropbox/JIC/expVIPMetadatas/RefSeq1.0/Genes/annotation/IWGSCv1.0_UTR_ALL.cdnas.fasta --tag 'NH'

0.8.3

0.8.2

0.8.1

0.8

0.7.3

0.7.2

0.7.1

0.7.0

0.6.1

Notes