estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

get the coordinate of synteny block #16

Closed lijing28101 closed 2 years ago

lijing28101 commented 3 years ago

Hi,

I'm working on the comparison of whole genome sequence for different maize lines. The genome is large, ~ 2GB per line, and Mummer took me a long tome for each pairwise comparison. I found that chromeister is very fast and I want to apply it to all my comparison. I need an output format similar as the output by show-coords from Mummer. The output should includes start and end position on the query and target. However, I only got a coordinate as the plot X-Y axis using gecko from chromeister output. Could you help me to figure out how to get the coordinates based on each chromosome?

Best, Jing

estebanpw commented 3 years ago

Hello @lijing28101

Nice to hear that you are using Chromeister!

I have just pushed an update to the gecko repository (and updated the information in chromeister README) where coordinates are included when you extract alignments. I think this will be helpful for what you are trying to achieve.

When you run gecko from the output of chromeister, you will get a csv file which already contains the coordinates of each alignment, e.g.:

Type,xStart,yStart,xEnd,yEnd,strand(f/r),block,length,score,ident,similarity,%ident,SeqX,SeqY
Frag,10501365,169863604,10501485,169863724,f,0,121,292,97,60.33,0.80,0,0
Frag,10501365,169863600,10501485,169863720,f,0,121,324,101,66.94,0.83,0,0
Frag,10417407,169989214,10417776,169988845,r,0,370,920,300,62.16,0.81,0,0
Frag,10437686,169985195,10437886,169984995,r,0,201,564,171,70.15,0.85,0,0
Frag,10534666,169927933,10535652,169926947,r,0,987,3452,925,87.44,0.94,0,0
[...]

The second column is xStart (start coordinate on the query), third column is yStart (start coordinate on the reference) and fourth and fifth are the same for ending coordinates, respectively.

If additionally you need the alignments and their coordinates, just add the keyword alignments in your gecko execution like this:

bin/guidefastas.sh query.fasta ref.fasta hits-XY-dotplot.mat.hits 1000 100 60 32 alignments

This will generate an alignments file containing the alignments and their coordinates, such as:

AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAGAAAGAAAA
||||||||||||||||||||||||||||||||||| | |||  ||  || ||| ||| |||||||||||||||||||||||||||||||||| | | | ||
AAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAAGAAAGAAGGAAGGAAGGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA
@ FORWARD STRAND x1: 10501385 y1: 169863509 x2: 10501485 y2: 169863609 Identity: 88/101 (87.1287%)
TTTTCCCATTGATTAATATTTTTCCTGTTGAGCAGATGAGAGAAAGCCAAAAAAAGCACAGCTGGGCCATTTCCCCTCACTGGGAACGTCATTTCCAGGCACTTTGTGCTTACTTGAT
|||||||||||||| ||||||||| | |||| | |||||||||||||||||||||||||||||||||||||||| |||||||  |||||||||||||| |||| |||||| |||||||
TTTTCCCATTGATTGATATTTTTCTTATTGAACTGATGAGAGAAAGCCAAAAAAAGCACAGCTGGGCCATTTCCTCTCACTGTAAACGTCATTTCCAGTCACTCTGTGCTCACTTGAT
@ REVERSE STRAND x1: 10451224 y1: 169981322 x2: 10451341 y2: 169981205 Identity: 107/118 (90.678%)

Notice that the coordinates are referred to as x1: 10501385 y1: 169863509 x2: 10501485 y2: 169863609.

Let me know if this helps you. Also if you use it, remember to run git pull origin in your gecko repository within the inmemory_guided_chrom branch.

Best regards, Esteban

lijing28101 commented 3 years ago

Hi Esteban,

I've tried the new version of gecko. But the result is still not what I want. The coordinates in csv for whole genome comparison is accumulative, not the real coordinate for each chromosomes. For example, If chr1 is 1-10000, then the coordinate for chr2 is 10000-20000, chr3 is 20000-30000..... But I want the coordinate for each block is based on each chromosome. Furthermore, the output of syntenic block for chr1 is not from start of chromosome. I tested on two maize line, the first block is

#chromeister output
Type,xStart,yStart,xEnd,yEnd,strand(f/r),block,length,score,ident,similarity,%ident,SeqX,SeqY
Frag,1098106,2069163,1099226,2070283,f,0,1121,4076,1070,90.90,0.95,_chr1,_chr1
Frag,1102117,2067025,1102307,2067215,f,0,191,596,170,78.01,0.89,_chr1,_chr1
Frag,1102517,2067414,1103418,2068315,f,0,902,2560,771,70.95,0.85,_chr1,_chr1

However, when I tried mummer4, it can found synteny block from beginning

#mummer4 output
chr1    1       1867    chr1    10038   11881   95.45   -
chr1    1       1170    chr1    14911   16057   93.00   -
chr1    1       1819    chr1    273654165       273655922       85.96   +
chr1    1       3628    chr3    892015  895591  87.05   +
chr1    1       2471    chr5    199334512       199336939       91.96   -
chr1    1       1582    chr5    200550730       200552286       94.26   -
chr1    1       7776    chr5    201082035       201089629       89.17   -
chr1    1       7768    chr5    201256076       201263686       91.20   +
chr1    1       3970    chr5    201438340       201442219       93.36   -
chr1    1       13056   chr5    201538604       201551437       90.03   +

Best, Jing

estebanpw commented 3 years ago

Hello @lijing28101

Thank you for your feedback. I have added (and changed) functionality to the chromeister/gecko pipeline in order to achieve what you are asking.

First, remember to update your gecko repository within the inmemory_guided_chrom branch.

Second, in regards to getting the coordinates in respect to the chromosomes as well as sorted, you can now run the guidefastas script like this:

bin/guidefastas.sh querySeqs.fasta refSeqs.fasta hits-XY-dotplot.mat.hits 1000 100 60 32 --local (of course remember to change your dimension/length/similarity/wordsize parameters accordingly)

This will both change the coordinates from cumulative global to local in respect to each sequence and sort them first by their sequences and then by their coordinates, such as:

Frag,18304,910588,18370,910522,r,0,67,228,62,85.07,0.93,1,3
Frag,18376,910508,19135,909749,r,0,760,2496,692,82.11,0.91,1,3
Frag,1,475077,476,474602,r,0,476,1888,474,99.16,1.00,1,4
Frag,2485,472593,7003,468075,r,0,4519,17956,4504,99.34,1.00,1,4
Frag,6982,468128,7228,467882,r,0,247,756,218,76.52,0.88,1,4
Frag,7184,467927,7505,467606,r,0,322,1184,309,91.93,0.96,1,4
Frag,9256,465864,9326,465794,r,0,71,276,70,97.18,0.99,1,4

Notice that the third alignment starts at position 1 in the 1,4 comparison.

Also, if you would rather have the names instead of the 1,4 comparison, execute instead like this:

bin/guidefastas.sh HOMSA.Chr.X.fasta MUSMU.Chr.X.fasta hits-XY-dotplot.mat.hits 1000 100 60 32 --local --names

Finally, even if you use --local, two csv files will be generated, the original csv which still has the accumulated coordinates and a second csv called *.localsorted.csv. This is the one you want. The idea behind keeping both is that you can still take your regular csv and upload it into our visualizer in order to play interactively with the alignments.

Hope this helps, also, since these are new changes, please report any bugs if you find them. Bests, Esteban