Generate reference genome masks

mathbionerd commented 8 years ago

Based on self-self chromosome lastZ alignments, identify regions to mask out when assessing depth coverage (not to be used for masking during the alignment step - we want reads to be able to align anywhere). We want to mask out regions in the reference genome that are expected to affect estimates of chromosome-wide depth.

We need this for X, Y, and all autosomes (chr19 will be the default reference autosome), for hg38/GRCh38 reference genome.

To extend this, we will generate these masks for hg19/GRCh37.

A nice pipeline for running lastZ and inferring these will also be useful and make the program more generalizable to any new reference genome build that comes out.

tanyaphung commented 8 years ago

Using the --exact flag (exact match) on lastZ seems to allow for more specificity:

A. Setting --exact=20:

lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=20 --format=rdotplot

ucscgrch38chry_ucscgrch38chry_exact20

B. Setting --exact=35:

ucscgrch38chry_ucscgrch38chry_exact35

C. Setting --exact=50:

ucscgrch38chry_ucscgrch38chry_exact50

I also tried: --seed=14of22 (Seeds require a 22-bp word with matches in 14 specific positions (1110101100110010101111))

lastz chrY.fa --self --notransition --ambiguous=iupac --step=20 --nogapped --nomirror --seed=14of22 --format=rdotplot

ucscgrch38chry_ucscgrch38chry_seed14of22

What are your thoughts in terms of what settings to use for lastZ?

tanyaphung commented 8 years ago

To do next: plot "masked coordinates" on the same plot of self-self lastZ plot. Goal is to visually check that the regions that are masked fall into the multi-mapped regions.

thw17 commented 8 years ago

Based on the identity line, it looks to me like either the first or fourth option looks best. Any significant differences in runtime?

tanyaphung commented 8 years ago

I didn't do any exact measurement of run time but both finished under a minute with the Y chr, so there is not a huge difference in run time that I can see.

mathbionerd commented 8 years ago

There is also an option to mask out the identity line in the middle - which we may want to do to assist with generating the reference mask file. From http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html:

Specify the ‑‑notrivial option. This performs the full computation on both copies, but doesn't report the trivial self-alignment block along the main diagonal (Figure 3(b)).

mathbionerd commented 8 years ago

My sense is this is close to what we want:

C. Setting --exact=50:

tanyaphung commented 8 years ago

Hi Melissa,

I just tried out the --notrivial flag that you suggested, my command is:

lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=50 --notrivial --format=rdotplot

However, I found that the result is the same with and without the --notrivial flag (the output files are the same). Not sure if the --notrivial flag works.

However, I will focus on verifying whether the masked coordinates I obtain from the script visually match next.

Thanks, Tanya

tanyaphung commented 8 years ago

Dot plot of running lastZ on chrY-chrY:

lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=50 --format=rdotplot

ucscgrch38chry_ucscgrch38chry_exact50

Dot plot showing the regions that are multimapped

ucscgrch38chry_ucscgrch38chry_exact50_targetqueryoutbufferregion100kb

For comparison, this plot shows chrY-chrY self alignment (output from lastZ) in black, and the multimapped regions in red:

ucscgrch38chry_ucscgrch38chry_exact50_overlay_targetqueryoutbufferregion100kb

My take is that the regions that are identified as "multimapped" visually match.

I have all the scripts on my local server. I will write a pipeline of how I obtain these coordinate next, and will put it up on github.

mathbionerd commented 8 years ago

This looks awesome! Yes, I think this is exactly what we're looking for.

Having a pipeline will be great, to allow users to use any reference genome they want. Great work!

tanyaphung commented 8 years ago

Hi Melissa,

Thanks for confirming that this is what we want. I will probably be able to get the pipeline up and the masked regions by the end of today.

mathbionerd commented 8 years ago

Thank you, Tanya!

tanyaphung commented 8 years ago

All of the lastZ analyses can be found at Files/lastZ

The main bash script is called generate_reference_genome_masks_pipeline. sh, and it calls other scripts that can be found in the scripts folder.

The folder generate _chrY_masks contains the output when running generate_reference_genome_masks_pipeline. sh.

The main output is window_query_out_bufferRegion50000.txt. The columns with the labels "target_start" and "target_end" indicates the regions that fall within the 10kb region ("win_start" and "win_end"). The columns with the labels "query_start" and "query_end" indicates that these regions are multi-mapped.

I think that the coordinates for "target_start" and "target_end" is what we want for the masked regions.

TODO:

intersect with gapped regions
merge the coordinates
Generate masks for autosomes and chrX

tanyaphung commented 8 years ago

I have several updates/comments.

First, in terms of getting the multimapped coordinates out of lastZ output, Melissa and I last week decided on the strategy of scanning along the genome in Xkb non-overlapping windows (I chose 10kb for now), then obtaining the coordinates where the target coordinates fall into that window but the query coordinates do not fall into that window (which is indicative of multimapping).

For example, let's say my window is

chrY 0 10000

Target coordinates

500 600 8000 8100 9000 9500

Query coordinates:

550 650 10200 10300 30000 30500

Then, we would call the regions (8000, 8100) and (9000, 9500) to be multimapped.

I thought that I should also allow for some more flexibility. What I mean is I would allow for a flanking region of a certain size. For example, if I allow my flanking region to be 5000, then, any query coordinates that fall within (0, 15000) would not be considered multimapped. Thus, only the region (9000, 9500) would be called multimapped because it also mapped to (30000, 30500) which is outside of (0, 15000).

Using this approach on chrY, with 10kb non-overlapping window, and allowing for 50kb flanking, I calculated the total number of bp that are called "multimapped" and found that there are 1926926 bp that are multimapped as compared to 451488 bp that is not. Does the number of bp that is not "multimapped" seem low?

I can run the script with different parameters, and see how it changes.

All the output can be found XYalign/Files/lastZ/generate_chrY_masks/

If anyone wants to test out masking these regions with XYalign, the file to use is toMaskRegions_sorted_merged.bed.

mathbionerd commented 8 years ago

Thank you for this, @tnphung!

There are about 10Mb of ampliconic regions on the human Y chromosome: http://www.nature.com/nature/journal/v423/n6942/fig_tab/nature01722_F4.html

Rough estimates from Skaletsky et al (2003) are:

Region	Mb	Number of Y-linked coding genes
X-transposed	3.4	2
X-degenerate	8.6	16
X-degenerate	10.2	60 (9 families)

What I'm confused about right now, is why the total nucleotide length is so small. The total should be about 20Mb (see above, which doesn't include the 2.7Mb PAR1 and 320kb PAR2), but the sum of what you report here seems to be around 2Mb. Am I missing something?

mathbionerd commented 8 years ago

Also for reference, the pdf of Skaletsky et al (2003):

http://pagelab.wi.mit.edu/pdf/2003%20-%20The%20male-specific%20region%20of%20the%20human%20Y%20chromosome%20is%20a%20mosaic%20of%20discrete%20sequence%20classes.pdf

We can approximate the palindromic/ampliconic regions from the last figure, as a sanity check. I haven't been able to find BED/coordinates of these regions.

mathbionerd commented 8 years ago

Is this issue ready to be closed, @tnphung?

tanyaphung commented 8 years ago

Hi Melissa,

Yes, you can close this issue. I still need to upload the masks for the other chromosomes (right now in the masks folder, there are chr19, X, and Y for hg38). It's running right now so I will upload them as soon as they finish running.

Thanks!

mathbionerd commented 8 years ago

Super!

WilsonSayresLab / XYalign

Generate reference genome masks #5