estebanpw / chromeister

A dotplot generator for large chromosomes
GNU General Public License v3.0
39 stars 4 forks source link

coordinates from the events file #22

Open pjm43 opened 2 years ago

pjm43 commented 2 years ago

Hi Esteban,

I'm interested in pulling the specific chromosomal coordinates identified in the events file after running run_and_plot_chromeister.sh. I believe they are reported globally across the genome (the genome has 21 chromosomes in the fasta) in this file, nor does the file specify the chromosome from which they are derived. What is the best way to covert global coordinate to local chromosomal coordinates for the events identified?

Thanks in advance,

Jeff

pjm43 commented 2 years ago

Sorry, I should be more specific. The events.txt file has multiple descriptions of the syntenic blocks, I'm specifically interested in the rearranged block (i.e., inversions, transposition, etc.) and not the conserved blocks. Thanks, Jeff

estebanpw commented 2 years ago

Hello Jeff!

You are correct in your guess: the coordinates are always global (internally the fasta headers > sequence ... get removed). In case you want these to be local there are two approaches:

Note: The script uses a naive approach which makes it slow if you the fasta files contain many sequences or if the events file contains many events. It is naive because it will check every event with all the lengths of the sequences until one that is bigger is found. This could be easily improved with a binary search, but it should work without further problem if you have e.g. 30 by 30 chromosomes.

Btw: I am no longer able to maintain this repository on a daily basis as I recently switched jobs. Still I will try to help when possible!

Hope this helps, Esteban

pjm43 commented 2 years ago

Esteban,

Thanks so much for you help - very much appreciated - I'll give you script a try now!

A couple more quick questions - is there a way to export the figure as a higher resolution file? I'm looking for something that I could import into illustrator to make manuscript quality figures. Lastly, I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?

I know you've switched jobs and really appreciate you still helping with chromeister.

Jeff

estebanpw commented 2 years ago

Hey there @pjm43 ,

Thanks so much for you help - very much appreciated

Thanks for using CHROMEISTER! :)

I'll give you script a try now!

Let me know if it works alright!

is there a way to export the figure as a higher resolution file?

I think this should be possible. In essence, what CHROMEISTER plots is basically a matrix of 0's and 1's after it has computed some stuff. This is the matrix called dotplot.mat.raw.txt (although the first two lines include the sequence lengths). I have tried something out that might work out for you. First, remove the two length lines from the file by doing:

tail -n +3 dotplot.mat.raw.txt > dotplot.nolen

Then copy this python code:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')

fig = plt.figure(figsize=(16, 16))
matrix = np.loadtxt('dotplot.nolen')
plt.imshow(matrix, cmap='Greys', interpolation='nearest')
plt.savefig("dotplot.svg")

Note: Remember to change the dotplot.nolen to the name of your comparison file. Note 2: You may need to set up a virtual env and install python packages with pip.

Then you can run the above python script and it will save the matrix to an svg file. If you are happy with the output, you can then add the axes names and titles (check out the matplotlib library for this).

This is the resulting svg for a normal comparison:

dotplot

(Note: you should be able to download the svg by right clicking on it, I am not sure how github will render it).

This is not a very clean solution but it might work for what you need. IMPORTANT: saving dotplots from CHROMEISTER as vector graphics file was not planned originally because the png file should provide with enough detail - especially considering that CHROMEISTER is very, very heuristic. This means that if you make the dotplot of too much resolution then some boundary pixels might show signal when there is none in reality, i.e. CHROMEISTER will give you a good coarse-grain preview, but going into the fine-grain detail might produce some artifacts (remember that every pixel in the dotplot typically corresponds to around 100,000 base pairs in sequences of length 100 MBp!). Always check that the results are correct and that they make sense!

I'm still trying to wrap my head around the level argument. It appears that it should be set to 4, except when using with large plant genome where you indicate that it could be set to 1. What exactly is this argument doing?

I would not give too much though to this parameter, but in essence it controls how many nucleotides are skipped to calculate the hash function. This means that if you input z=4 (the default) only 1 in every 4 nucleotides will be used to compute the hash of a word. If you use z=1, all of them will be used, thus leading to less collisions. Therefore, a low z value is recommended when comparing very large sequences with many repeats (such as plants) to avoid "too many" collisions. But in my experience this affects the output by very little and only in some comparisons, so I think you are probably safe leaving to 4.

Hope it helps! Esteban