Closed mathbionerd closed 8 years ago
Using the --exact flag (exact match) on lastZ seems to allow for more specificity:
A. Setting --exact=20:
lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=20 --format=rdotplot
B. Setting --exact=35:
C. Setting --exact=50:
I also tried: --seed=14of22 (Seeds require a 22-bp word with matches in 14 specific positions (1110101100110010101111))
lastz chrY.fa --self --notransition --ambiguous=iupac --step=20 --nogapped --nomirror --seed=14of22 --format=rdotplot
What are your thoughts in terms of what settings to use for lastZ?
To do next: plot "masked coordinates" on the same plot of self-self lastZ plot. Goal is to visually check that the regions that are masked fall into the multi-mapped regions.
Based on the identity line, it looks to me like either the first or fourth option looks best. Any significant differences in runtime?
I didn't do any exact measurement of run time but both finished under a minute with the Y chr, so there is not a huge difference in run time that I can see.
There is also an option to mask out the identity line in the middle - which we may want to do to assist with generating the reference mask file. From http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html:
Specify the ‑‑notrivial option. This performs the full computation on both copies, but doesn't report the trivial self-alignment block along the main diagonal (Figure 3(b)).
My sense is this is close to what we want:
C. Setting --exact=50:
Hi Melissa,
I just tried out the --notrivial flag that you suggested, my command is:
lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=50 --notrivial --format=rdotplot
However, I found that the result is the same with and without the --notrivial flag (the output files are the same). Not sure if the --notrivial flag works.
However, I will focus on verifying whether the masked coordinates I obtain from the script visually match next.
Thanks, Tanya
lastz chrY.fa --self --notransition --ambiguous=iupac --nogapped --nomirror --step=10 --exact=50 --format=rdotplot
My take is that the regions that are identified as "multimapped" visually match.
I have all the scripts on my local server. I will write a pipeline of how I obtain these coordinate next, and will put it up on github.
This looks awesome! Yes, I think this is exactly what we're looking for.
Having a pipeline will be great, to allow users to use any reference genome they want. Great work!
Hi Melissa,
Thanks for confirming that this is what we want. I will probably be able to get the pipeline up and the masked regions by the end of today.
Thank you, Tanya!
All of the lastZ analyses can be found at Files/lastZ
The main bash script is called generate_reference_genome_masks_pipeline. sh, and it calls other scripts that can be found in the scripts folder.
The folder generate _chrY_masks contains the output when running generate_reference_genome_masks_pipeline. sh.
The main output is window_query_out_bufferRegion50000.txt. The columns with the labels "target_start" and "target_end" indicates the regions that fall within the 10kb region ("win_start" and "win_end"). The columns with the labels "query_start" and "query_end" indicates that these regions are multi-mapped.
I think that the coordinates for "target_start" and "target_end" is what we want for the masked regions.
TODO:
I have several updates/comments.
First, in terms of getting the multimapped coordinates out of lastZ output, Melissa and I last week decided on the strategy of scanning along the genome in Xkb non-overlapping windows (I chose 10kb for now), then obtaining the coordinates where the target coordinates fall into that window but the query coordinates do not fall into that window (which is indicative of multimapping).
For example, let's say my window is
chrY 0 10000
Target coordinates
500 600 8000 8100 9000 9500
Query coordinates:
550 650 10200 10300 30000 30500
Then, we would call the regions (8000, 8100) and (9000, 9500) to be multimapped.
I thought that I should also allow for some more flexibility. What I mean is I would allow for a flanking region of a certain size. For example, if I allow my flanking region to be 5000, then, any query coordinates that fall within (0, 15000) would not be considered multimapped. Thus, only the region (9000, 9500) would be called multimapped because it also mapped to (30000, 30500) which is outside of (0, 15000).
Using this approach on chrY, with 10kb non-overlapping window, and allowing for 50kb flanking, I calculated the total number of bp that are called "multimapped" and found that there are 1926926 bp that are multimapped as compared to 451488 bp that is not. Does the number of bp that is not "multimapped" seem low?
I can run the script with different parameters, and see how it changes.
All the output can be found XYalign/Files/lastZ/generate_chrY_masks/
If anyone wants to test out masking these regions with XYalign, the file to use is toMaskRegions_sorted_merged.bed.
Thank you for this, @tnphung!
There are about 10Mb of ampliconic regions on the human Y chromosome: http://www.nature.com/nature/journal/v423/n6942/fig_tab/nature01722_F4.html
Rough estimates from Skaletsky et al (2003) are:
Region | Mb | Number of Y-linked coding genes |
---|---|---|
X-transposed | 3.4 | 2 |
X-degenerate | 8.6 | 16 |
X-degenerate | 10.2 | 60 (9 families) |
What I'm confused about right now, is why the total nucleotide length is so small. The total should be about 20Mb (see above, which doesn't include the 2.7Mb PAR1 and 320kb PAR2), but the sum of what you report here seems to be around 2Mb. Am I missing something?
Also for reference, the pdf of Skaletsky et al (2003):
We can approximate the palindromic/ampliconic regions from the last figure, as a sanity check. I haven't been able to find BED/coordinates of these regions.
Is this issue ready to be closed, @tnphung?
Hi Melissa,
Yes, you can close this issue. I still need to upload the masks for the other chromosomes (right now in the masks folder, there are chr19, X, and Y for hg38). It's running right now so I will upload them as soon as they finish running.
Thanks!
Super!
Based on self-self chromosome lastZ alignments, identify regions to mask out when assessing depth coverage (not to be used for masking during the alignment step - we want reads to be able to align anywhere). We want to mask out regions in the reference genome that are expected to affect estimates of chromosome-wide depth.
We need this for X, Y, and all autosomes (chr19 will be the default reference autosome), for hg38/GRCh38 reference genome.
To extend this, we will generate these masks for hg19/GRCh37.
A nice pipeline for running lastZ and inferring these will also be useful and make the program more generalizable to any new reference genome build that comes out.