bcgsc / xmatchview

🗻 Visualization of genome/gene sequence synteny
GNU General Public License v3.0
36 stars 7 forks source link

Alignment order #28

Closed asan-emirsaleh closed 1 year ago

asan-emirsaleh commented 1 year ago

Hello! Really great tool. For me, the hive plot function appears to be very useful. But I encountered with an issue when aligned sequences as you have described in your decumentation. I did it this way:

-x alignment file [1 vs. 2] (cross_match .rep or Pairwise mApping Format .paf)
-y alignment file [1 vs. 3] (cross_match .rep or Pairwise mApping Format .paf)
-z alignment file [3 vs. 2] (cross_match .rep or Pairwise mApping Format .paf)

the plot was blank. But when I have changed the places of the query and target (e.i. 2 -> 1, 3 -> 1, 2 -> 3), I did obtain the plot. But I cannot judge is it correct. So the issue is that the order you have recommended didn't succeed, but vice versa, the opposite did. I have used the lastz aligner for .paf to obtain because of its perfomance and accuracy on closely related datasets (plastome belonged to the same plantae tribe).

Best regards Asan

warrenlr commented 1 year ago

thank you for your report and your interest in our tool, @asan-emirsaleh

This is curious, perhaps you could indicate, for the file order that didn't work, the various info in 1, 2 and 3 for -q, -r, -s and the first alignment line in -x, -y , -z ? Also, which version of the tool dis you use..a copy of the log would also be useful to help debug.

If you look at the test-hive folder, you'll see an example (SARS) that works for the order prescribed; perhaps you could also inspect the files there? One thing to also keep in mind is the query vs. target order in the PAF file itself:

https://github.com/lh3/miniasm/blob/master/PAF.md

where the first column is the name of the query (e.g., 1) and the 6th is the target (e.g., 2) -- perhaps those are reversed in your lastz command/output?

Thank you, Rene

asan-emirsaleh commented 1 year ago

Thank you for your response. I used the recent version of xmatchview, with it being git-cloned this week. The fasta files used are biopython-produced fragments of plastome sequences, where each id equals the file basename and name used in config files. I have a set of several comparisons, where I performed both direct and reciprocal alignements (e.g. first_2_second and second_2_first). I tried to use both lastz and minimap2 of recent versions, the result does remain the same. I made some python script in jupyter nb to quickly assess the result. Only the reciprocal alignements succeeded with a plot, using the direct ones led to the blank plot without any alignement. As for the paf file content, you are right, in my .paf files produced both via lastz and minimap2, the first column is a target.

Here is my code that succeeded - https://github.com/asan-emirsaleh/scripts-on-Genomics/blob/main/hive-plot.py I used the target and query options in a proper order in aligner parameters. I cannot understand what a typo I might have done.

Best regards Asan

asan-emirsaleh commented 1 year ago

Hello @warrenlr Could you please write your vision on that issue? It is still not clear for me 1) does the script I used produce a correct result? 2) why the script didn't work with sequences aligned in direct order (not the reciprocal)? Is there an issue in my script, for example in the way I use aligner?

warrenlr commented 1 year ago

not easy for me to determine by looking at your script. for instance, this code block:

plot = subprocess.run(["python", app_hive, "-q", first + ".txt", "-r", second + ".txt", "-s", third + ".txt", "-x", second + "2" + first + ".paf", "-y", third + "2" + first + ".paf", "-z", second + "2" + third + ".paf",

is the order that works or the one that doesn't? If the latter, makes sense since the order should be 1vs2 1vs3 and 3vs2 for x,y,z. If the former, it is possible that the alignments provided are switched, but it is not possible for me to confirm without taking a look at your files, or at least the first few lines of each. You can send me an email at rwarren@bcgsc.ca if there are sensitivities around sharing your data for troubleshooting this.

I also re-cloned the repo and re-verified the order by re-aligning the SARS genomes (using minimap2) and re-plotting as per the test-hive example, and the order checks out.

  1. Run minimap2 between each pair of sequences. 1 vs 2 minimap2 SARS.fa SARS2.fa > SARS2-vs-SARS.paf 1 vs 3 minimap2 MERS.fa SARS2.fa > SARS2-vs-MERS.paf 3 vs 2 minimap2 SARS.fa MERS.fa > MERS-vs-SARS.paf # (base) [rwarren@hpce704 test-hive]$ cat *paf SARS-CoV-2 30473 13996 20902 + MERS-CoV 30033 13926 20805 60 6906 17 tp:A:P cm:i:4 s1:i:55 s2:i:0 dv:f:0.0267 rl:i:0 SARS-CoV-2 30473 62 29870 + SARS-CoV 29744 27 29695 3880 29837 60 tp:A:P cm:i:423 s1:i:3841 s2:i:0 dv:f:0.0003 rl:i:0 # (base) [rwarren@hpce704 test-hive]$ cat config.txt 1:SARS-CoV-2 2:SARS-CoV 3:MERS-CoV # (base) [rwarren@hpce704 test-hive]$ cat SARS2.txt SARS.txt MERS.txt SARS-CoV-2:30473 SARS-CoV:29751 MERS-CoV:30033 #
  2. Make a 3-way SVG hive synteny plot from

../xmatchview-hive.py -q SARS2.txt -r SARS.txt -s MERS.txt -x SARS2-vs-SARS.paf -y SARS2-vs-MERS.paf -z MERS-vs-SARS.paf -e SARScds.gff -i 0 -b 1 -c 50 -a 0.75

Please take a second look at your files to ensure the order is consistent. I am also happy to take a look at your files if you'd like to share them. Rene

warrenlr commented 1 year ago

in your script:

    aln = subprocess.run(["minimap2",
                          "-x", "asm20",
                          prefix + target + ".fna",
                          prefix + query + ".fna",
                          "-N200",
                          "-p0.0001",
                          "-o", target + "_2_" + query + ".paf"],

The output should instead read:

                          "-o", query + "_2_" + target + ".paf"],

such that:

                       "-x", second + "_2_" + first + ".paf",
                       "-y", third + "_2_" + first + ".paf",
                       "-z", second + "_2_" + third + ".paf",

instead reads:

                       "-x", first + "_2_" + second + ".paf",
                       "-y", first + "_2_" + third + ".paf",
                       "-z", third + "_2_" + second + ".paf",

That said, because of the way you are formatting, you likely thought that the order was backwards, but it really isn't since the respective alignments were properly done by the looks of things.