darrenabbey / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
3 stars 0 forks source link

ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.

As I am no longer a member of the Berman lab, due to completing my PhD degree there and moving on to other things, I don't have direct access to the repo I used while there. I made this fork so that I could more easily work on the project when I feel inclinced to do so. Major changes will be communicated to the berman-lab and potentially incorporated into their repo.

The berman-lab/ymap repo (https://github.com/berman-lab/ymap) supports the tool running live at http://lovelace.cs.umn.edu/Ymap/ for use by the yeast research community. Availability and time to process datasets may vary depending on server availability issues outside my control.

The pipeline can help you analyze your genomic datasets for non-yeast species, but the user interface and figures produced were designed with relatively small genome sizes in mind. (Candida albicans is about 14.7 Mbase in size.) If you want to experiment with using Ymap with larger genomes, I advise you to set up a local installation (https://github.com/darrenabbey/ymap/wiki/Setting-up-Ymap-locally). Genomes that are very small will also likely not produce very useful output.

The paper introducing the project can be found at: https://genomemedicine.biomedcentral.com/articles/10.1186/s13073-014-0100-8

If your find the tool useful in your work, please cite the above paper. Also, it is really cool to hear what people have been doing with the tool, so feel free to reach out to me in any of the several public spaces I can be found.

-- Darren Abbey, PhD

[thebiologistisn@twitter] [thebiologistisn@mastadon(redwombat.social)] [thebiologistisn@TikTok] [thebiologistisn@reddit] [thebiologistisn@instagram] [darrenabbeydesigns@instagram] [http://the-biologist-is-in.blogspot.com/]

======================================================================================

Text below copied from "YMAP_User_Manual.docx". Download and open that file for detailed desciptions of the Ymap user interface and instructions for use. Do note, the specific details of the current version are likely to sometimes change without this document being updated.

======================================================================================

Introduction Overview

YMAP is a web-based tool for the analysis of genome alterations in single cell Eukaryotes. The purpose of this tool is to facilitate the analysis of three major genome changes: copy number variations (CNV), single nucleotide polymorphisms (SNPs) and loss of heterozygosity (LOH).

======================================================================================

The YMAP implementation

The genome analysis pipeline is composed of three main components: a module that performs raw sequence alignment and processing, a module that performs custom copy number variation (CNV) and single nucleotide polymorphism (SNP)/loss of heterozygosity (LOH) analyses, and a module that constructs figures summarizing all completed analyses and then displays them on the webpage. The implementation details for each of these components are described in more detail in the following sections.

The first component of the central computation engine takes the user-input data and attempts to correct some typical file errors before outputting corrected FASTQ file(s) for use by later steps in the pipeline. Typical sequence data is input as one or two (for paired-end reads) FASTQ format files, either raw or compressed in the ZIP or GZ format. Uploading is facilitated by using compressed file format. Uploads of compressed files typically takes a few hours, depending on the file size. The large size of FASTQ files leaves them prone to file transfer errors that result in corruption because the file format doesn’t have an internal error correction/identification system. This corruption often results in the final read entry being incomplete, which can cause analysis programs to crash, and normally has to be dealt with on a case-by-case basis. Trimming the FASTQ file to remove incomplete read entries can then result in paired-end files having imbalanced numbers of reads. Raw paired-end read files also often contain a minority of single-end reads that are generated by some sequencing technologies. Both causes of imbalanced reads can interfere with the processing by some tools designed for paired-end reads exclusively. Trimming the longer of the paired-end FASTQ files to the length of the shorter file resolves this issue and completes the implemented automatic FASTQ file correction algorithm.

The second step in the central computation pipeline is to processes the corrected FASTQ file into a final Binary sequence Alignment/Mapping (BAM) file. The single- or paired-end reads are aligned to one of the installed reference genomes using Bowtie2 [27], resulting in a Sequence Alignment/Mapping (SAM) file. SAMtools [28] is used to compress this into a BAM file. PicardTools [29] is used to standardize the read-group headers in the BAM files, to resolve some formatting irregularities to the BAM file. SAMtools is then used to sort the BAM file, which is required for efficient later processing steps. FASTQC [30] is used to identify the quality coding system used in the input FASTQ files, as a prelude to defining the input parameters for processing by the Genome Analysis ToolKit (GATK [31]), which performs indel-realignment of the BAM files, removing spurious apparent SNPs around true indels in the primary alignment.

The third step in the sequence data processing component of the pipeline is to convert the BAM file into a simpler text file containing limited data for each coordinate across the genome, which simplifies later processing. The SAMtools function mpileup first processes the BAM file into a ‘pileup’ file, which contains information about all of the mapped reads at each chromosome coordinate in a simple format that facilitates subsequent processing by custom Python scripts. The Python scripts extract base call counts for each coordinate, discarding indel and read start/end information. The raw read-depth data per coordinate is saved to a text file that is input into the CNV analysis section of the pipeline. Any coordinates with more than one base call have that information saved to a separate text file that is input into the SNP and LOH analysis section of the pipeline.

======================================================================================

CNV Analysis

CNV analysis of NGS data by the pipeline is based upon read depth across the genome. Several biases can impact read depth and thereby interfere with CNV analysis. Two separate biases, a chromosome-end bias and a GC-content bias, appear sporadically in all types of data examined (including microarray and whole genome sequence (WGseq) data). The mechanism that results in the chromosome end artifact is unclear, but the smooth change in the apparent copy number increase towards the chromosome ends (Fig. 2A) suggests that some DNA preparations may release more genomic DNA as a function of telomere proximity (Jane Usher, personal comm.). A GC-content bias is due to strong positional variations in GC content in the C. albicans genome. This, combined with the PCR amplification bias introduced during sequence library or array preparation, results in a strong positional effect in local copy number estimates (Fig. 3A). In datasets produced from the ddRADseq protocol, a third bias is associated with the length of restriction fragments. A fourth bias, seen consistently in all ddRADseq data sets, appears as a high frequency of short-range increases and decreases in read depth at specific genome positions across all strains analyzed, and thus can be removed by normalization to a control dataset from the reference genome. The YMAP pipeline includes filters, which can be deselected by the user, for each of these biases to correct the data before final presentation and to facilitate detection of bona fide CNVs. The final presentation of the corrected copy number data is in the form of a histogram drawn vertically from the figure centerline (Figs. 2-4, A-B).

The chromosome-end bias is normalized using locally weighted scatterplot smoothing (LOWESS) normalization [32] of average read depth vs. distance to the nearest chromosome end, for 5000 bp windows tiled along each chromosome (Fig. 2C). The LOWESS fitting is performed with a smoothing window size determined for each dataset as that which produces the least error between the fit and the raw data, using 10-fold cross-validation [33]. Dividing the raw data by the fit curve normalizes the bias (Fig. 2D), allowing an unimpeded view of the mapped genome (Fig. 2B, a diploid with no significant CNVs). Because this bias is sporadically present, the correction is optional and is not performed by default.

The GC-content bias is normalized using LOWESS normalization of average read depth vs. GC content, for 5000 bp windows tiled along each chromosome] (Fig. 3C). The LOWESS fitting is performed with a smoothing window size determined for each dataset as that which produces the least error between the fit and the raw data using 10-fold cross-validation. Dividing the raw data by the fit curve normalizes this bias (Fig. 3D) allowing an unimpeded visual examination of CNVs across the genome. For example, it can distinguish chromosome number for a near-tetraploid strain with a small segmental duplication near the centromere of ChrR, three copies of chromosomes 4, 5R and 6, and with 7 copies of the left arm of chromosome 5R (due to the presence of 3 copies of whole Chr5 and 2 copies of an isochromosome (5L) with 2 copies of Chr5L per isochromosome) (Fig. 3B). Because this bias is always present to some degree in all data types examined, the correction is performed by default unless deselected by the user. The ddRADseq protocol generates high read depths at a sub-sampling of genomic loci, resulting in a much-reduced total cost per strain sequenced. The protocol produces a library of restriction fragments digested with two different restriction enzymes (in this case MfoI and MpeI). A strong bias exists in the read depth vs. the length of each valid restriction fragment (obtained via a simulated digest of the reference genome, followed by selecting fragments that have the two restriction fragment ends), (Fig. 4C). The fragment-length-bias is filtered using LOWESS normalization of an average read depth vs. the simulated fragment frequency. The LOWESS fitting is performed with a smoothing window size determined for each dataset as that which produces the least error between the fit and the raw data. Restriction fragments less than 50 bp or greater than 1000 bp show average read depths that exhibit a too much noise and are considered unreliable. Where the LOWESS fit line drops below one read, the fragments are considered unreliable due to the reduced dynamic range in the data. These unreliable data are noted (red points in Fig. 4D) and not used in later steps of the analysis.

For ddRADseq analyses, first the chromosome-end and GC-content bias corrections are applied using data per valid restriction fragment instead of the standard-sized 5000 bp windows used in WGseq analysis. After these corrections are performed, the remaining strong position-effect bias in read depth that remains uncharacterized. This final bias is corrected by normalizing the corrected read depths for each usable restriction fragment by the corrected read depths from a euploid reference dataset. Because the earlier biases differ from dataset to dataset, the reference-normalization is performed as the final normalization step. The result of these corrections is a pronounced reduction in noise in the CNV data as seen by comparing the raw read depth (Fig. 4A) to the corrected read depth (Fig. 4B) for an example dataset.

After these corrections are applied to the raw sequence read data, the corrected copy number estimates are locally smoothed to reduce the impact of high-frequency noise. The estimates are then multiplied by the whole genome ploidy estimate that was determined by flow cytometry of DNA content and entered during setup of the project. The corrected estimates are plotted as a histogram along each chromosome, with the lines drawn vertically from the baseline ploidy entered during project setup. Copy number variations are then evident as regions with prominent black bars.

======================================================================================

SNP/LOH analysis

SNPs are regions of a genome that have two different alleles at the same locus on different homologs. The allelic ratio (0 or 1 for homozygous regions and 0.5 for heterozygous regions in a diploid genome) is used to determine whether a region that had SNPs in the parent/reference strain has undergone LOH to become homozygous. An allelic ratio is calculated for each coordinate by dividing the number of reads with the more abundant base base call by the total number of reads at each coordinate (resulting in values ranging from 0.5 to 1.0).

Three styles of analysis are performed, depending on user input during the project setup. The first style is the default option, which is used when no reference strain or hapmap is available. In this case, the SNP distribution for the strain of interest is displayed as vertical grey bars in the background of each chromosome. Once analysis has completed, this strain can be used as the ‘parent’ for other related strains. In the second style of analysis, a parent strain is chosen and the SNPs in common between that parent and the test strain being analyzed are displayed as grey bars (as in the first style), while any SNPs in the parent that have different allelic ratios in the test strain are displayed in red, if allelic ratios approach 0 or 1, or in green, if ratios suggest unusual allele numbers (often due to CNVs or aneuploidy). The third style of analysis can be chosen if a hapmap for the parent strain background is available. SNPs that remain heterozygous are again displayed in grey, while those that have become homozygous are displayed in the color assigned to the homolog that is retained (e.g., cyan for the ‘a’ allele and magenta for the ‘b’ allele).

For the default option, any coordinates with an allelic ratio near 0.5 (0.50 to 0.75) are considered heterozygous. More extreme allelic ratios are considered to be homozygous, appearing in the dataset due to sequencing errors. The density of heterozygous SNPS is presented as vertical lines spanning the height of each chromosome cartoon, with the intensity of grey color representing the number of SNPs in each 5000 bp bin. If there are fewer than 100 SNPs in a bin, it is drawn with a lighter shade corresponding to the number of SNPs relative to the 100 SNPs threshold. This results in white backgrounds for homozygous regions and increasingly dark shades of grey for regions with higher numbers of SNPs (Fig. 5A).

When a parental type strain of unknown genotype (e.g., a clinical isolate) is selected for a project, the pipeline first calculates the distribution of SNPs across the parental genome in the manner described above. For comparison of the parental genotype to another related strain (e.g., another sample from the same patient), every heterozygous SNP locus in the parent is examined in the second dataset. If the allelic ratio changes from the 0.5 value observed in the reference strain, the SNP is assigned a red color and the final color of each 5000 bp display bin is calculated as the weighted average of all the SNPs within the bin (Fig. 5B). An alternate presentation assigns red color only to coordinates that have transitioned from heterozygous to homozygous (allelic ratio of 1.0) and assigns the green color to coordinates that have unusual allelic ratios (allelic ratios between 0.75 and 1.0, only excluding those with allelic ratios precisely at 1.0) (Fig. 5C). Low SNP counts are factored into the presented colors, as described above for the first style of analysis.

When a known hapmap is selected for a project, the pipeline loads SNP coordinates from the map and examines the allelic ratios of the dataset at those coordinates. For disomic regions of the genome, any SNP locus with an allelic ratio near 0.5 (0.50 to 0.75) is considered heterozygous and assigned the color grey. Any SNP locus with a more extreme allelic ratio is considered homozygous and assigned the color corresponding to the homolog with the matching allele in the map. For regions that are monosomic, trisomic, or larger, colors are assigned to SNPs based on the apparent ratio of homologs present. SNPs within each 5000 bp bin are gathered and the final presented color is determined as the weighted average of the colors assigned to the individual SNPs (Fig. 5D). Low SNP counts are factored into the presented colors as in the cases previously described.

The sparse datasets produced from the ddRADseq protocol introduce a high sampling error to allelic ratio calls, increasing the uncertainty of SNP calls and an increased incidence of coordinates that appear as a SNP in one dataset, but not another. This sampling error in allelic ratio calls interferes with the direct comparison of SNP loci between a dataset and a parental type dataset. If one dataset is examined without comparison to a reference – producing a very noisy CNV map – the allelic ratios are plotted as grey lines emanating from the top and bottom of each chromosome cartoon inwards to the ratio calculated for each coordinate (where the Y-axis ranges from 0.0 to 1.0 for the lines) (Fig. 6A). When a dataset is examined in comparison with a reference, the pipeline produces a figure with allelic ratios for the reference strain drawn as grey lines emanating from the bottom of the cartoon and allelic ratios for the test dataset plotted as red lines drawn from the top of each chromosome (Fig. 6B). Loci with a read-depth lower than 20 are ignored, because the corresponding high sampling error produces a high likelihood of spurious midrange allelic ratios that can appear as heterozygous.

If the user selects a hapmap while setting up an analysis, the higher resolution data of the hapmap allows every SNP locus that appears in the dataset to be examined. The allelic ratios, coupled with the SNP homolog identity information from the hapmap [24, 25], allows coordinates to be assigned colors by how consistent they are with either homolog or with the heterozygous state. Lines are then drawn from the top to the bottom of each chromosome for coordinates with allelic ratios less than 1.0, in the color previously assigned (Fig. 6C). Allelic ratios of exactly 1.0 are not drawn because they often represent the sampling error found in low read depth areas of the sparse dataset. Visual comparison between the allelic ratio plots for related strains facilitates the identification of large regions of LOH (Fig. 6D: magenta at end of left arms of Chr1).

======================================================================================