conchoecia / odp

oxford dot plots
GNU General Public License v3.0
130 stars 9 forks source link

Problems in finding ancestral linkage groups #27

Closed ArthurSilver closed 1 year ago

ArthurSilver commented 1 year ago

Hi,

odp is a valuable tool for synteny analysis, although I have some questions in finding ancestral linkage groups. Specifically, I am seeking to identify ALGs in three distinct species(lancelet, chicken and sptted gar), I have successfully executed the odp_nway_rbh scripts, and as a result, obtained the rbh file in the step3-unwrap folder. However, I am uncertain about the subsequent steps required to determine the quantity of ALGs. image Does a fixed combination of chromosomes between three species represent an ALG? (like Al Chr5, Gg NC_006092.5, Lo LG7) And how can I get the Oxford dotplot and ribbon plot from this rbh file?

Looking forward to your reply!·

conchoecia commented 1 year ago

Hello!

If you let me know the goals of your study I can maybe make better suggestions to help you get there.

Is your question related to this paper? Simakov, Oleg, et al. "Deeply conserved synteny resolves early events in vertebrate evolution." Nature Ecology & Evolution 4.6 (2020): 820-830. The chordate LGs have already been characterized in this manuscript and there are 17.

If you're trying to recreate these results, then look at the file odp_nway_rbh/step2-groupby/*.rbh.groupby. I am guessing that there will be around 17 LGs. The methods that odp uses to determine ALGs are not the same as the methods used in Simakov et al. 2020, so I do not know the exact number. The file odp_nway_rbh/step3-unwrap/*.filt.unwrapped.rbh contains the linkage groups that have a FDR <=0.05, and these are unwrapped such that each line represents one 3-way reciprocal best hit between the species (think of it like an OrthoGroup).

ArthurSilver commented 1 year ago

Hello!

If you let me know the goals of your study I can maybe make better suggestions to help you get there.

Is your question related to this paper? Simakov, Oleg, et al. "Deeply conserved synteny resolves early events in vertebrate evolution." Nature Ecology & Evolution 4.6 (2020): 820-830. The chordate LGs have already been characterized in this manuscript and there are 17.

If you're trying to recreate these results, then look at the file odp_nway_rbh/step2-groupby/*.rbh.groupby. I am guessing that there will be around 17 LGs. The methods that odp uses to determine ALGs are not the same as the methods used in Simakov et al. 2020, so I do not know the exact number. The file odp_nway_rbh/step3-unwrap/*.filt.unwrapped.rbh contains the linkage groups that have a FDR <=0.05, and these are unwrapped such that each line represents one 3-way reciprocal best hit between the species (think of it like an OrthoGroup).

Thanks for the reply, I did try to reproduce the results in the Simakov's paper, albeit utilizing a novel B.floridae assembly instead(Three amphioxus reference genomes reveal gene and chromosome evolution of chordates). Does every row in the file odp_nway_rbh/step2-groupby/*.rbh.filt.groupby correspond to a linkage group? There are 86 rows in groupby file, and if that's the case then the result is significantly greater than 17 LGs. image I am also interested in creating a colored Oxford dotplot similar to Simakov et al. 2020, as well as a ribbon plot like the one presented in Simakov et al. 2022. Would you be able to provide guidance on how to achieve these visualizations?

Looking forward to your reply.

conchoecia commented 1 year ago

Thanks for clarifying - each of the CLGs identified in Simakov et al 2020 were defined based off of the identity of specific B. floridae chromosomes compared to the other species you mentioned. From Supplementary Data Table 6 from that paper, this is how the CLGs were defined:

Supplementary Table 6. Assignment of Amphioxus chromosomes to inferred ancestral chordate linkage groups (and base pair positions of the assignments)
CLGA BFL_1
CLGB BFL_10 + BFL_16 + BFL_18
CLGC BFL_2.1 + BFL_3.1
CLGD BFL_6
CLGE BFL_5
CLGF BFL_7
CLGG BFL_11
CLGH BFL_13
CLGI BFL_4.2
CLGJ BFL_2.2 + BFL_17
CLGK BFL_9
CLGL BFL_15
CLGM BFL_8
CLGN BFL_12
CLGO BFL_4.1 + BFL_19
CLGP BFL_14
CLGQ BFL_3.2
where the segments of amphioxus chromosomes 2, 3, and 4 are defined as
BFL_2.1 BFL_2(1-10,906,430) + BFL2(15,371,497-23,880,387) + BFL2(25,960,386-end)
BFL_2.2 BFL_2(10,906,431-15,371,496) + BFL_2(23,880,388-25,960,385)
BFL_3.1 BFL_3(1-16,935,122)
BFL_3(16,935,123-end)
BFL_4.1 BFL_4(1-11,474,496)
BFL_4.2 BFL_4(11,474,497-end)

The way that odp_nway_rbh currently determines linkage groups is by looking for combinations of chromosomes that appear more often than expected by random chance (given the gene quantity and chromosomes of the 3 species being compared). The count column in odp_nway_rbh/step2-groupby/*.rbh.filt.groupby is the number of 3-way orthologs shared between that specific chromosome combination, and the alpha column is the false discovery rate of finding a linkage group that large.

If you compare the largest 30 or so rows in this spreadsheet you will see that they likely correspond to the CLGs described in Simakov et al 2020. The ALGs defined in the odp spreadsheet will, by definition, contain fewer orthologs than the Simakov et al 2020 set because of odp's strict method of finding linkage groups that I described above.

If you wish to make Oxford Dot Plots, then you can follow these instructions. Right now the only curated ALG sets that will automatically color your plots is a more strict subset of the the BCnS ALGs from Simakov et al 2022, and the pre-metazoan LGs from Schultz et al 2023. I recommend just looking at the plots generated from this pipeline that will be located in odp/step2-figures/synteny_coloredby_*, and you can correlate the LGs there with what your question might be.

If you would like to make ribbon diagrams, the only way supported currently with odp is with the rbh_to_subway script, which works by feeding it a list of results from the normal odp pipeline I mentioned in the paragraph above, and specifying the species order of the figure.

If you would like to define your own set of LGs as you would find in the LG_db directory, that is currently undocumented. If you would like there to be a release of the Simakov et al 2020 CLGs that can be used for plotting, let us know.

conchoecia commented 1 year ago

Hello, so in the last few pushes I have included an implementation of the chordate linkage groups (warning: it will take a long time to make and to run). I have also included instructions on how to make a ribbon diagram.

  1. First, I recommend doing a git pull to update the repo.
  2. Set up the CLGs to work with odp
  3. Run odp
  4. Make a ribbon diagram

I think this addresses the issues you brought up. I'll leave this issue open for about two weeks, then I will close it. Let me know in this thread if you run into other related problems.

ArthurSilver commented 1 year ago

Hello, so in the last few pushes I have included an implementation of the chordate linkage groups (warning: it will take a long time to make and to run). I have also included instructions on how to make a ribbon diagram.

  1. First, I recommend doing a git pull to update the repo.
  2. Set up the CLGs to work with odp
  3. Run odp
  4. Make a ribbon diagram

I think this addresses the issues you brought up. I'll leave this issue open for about two weeks, then I will close it. Let me know in this thread if you run into other related problems.

Thanks a lot and I'll give it a try.

ArthurSilver commented 1 year ago

Hello, so in the last few pushes I have included an implementation of the chordate linkage groups (warning: it will take a long time to make and to run). I have also included instructions on how to make a ribbon diagram.

  1. First, I recommend doing a git pull to update the repo.
  2. Set up the CLGs to work with odp
  3. Run odp
  4. Make a ribbon diagram

I think this addresses the issues you brought up. I'll leave this issue open for about two weeks, then I will close it. Let me know in this thread if you run into other related problems.

Hello!

I was able to successfully run odp with CLGs but failed in running odp_rbh_to_ribbon

Here's error message:

Error in rule make_plot:
    jobid: 0
    input: /public/home/project/amphioxus/synteny/odp/odp/step2-figures/synteny_coloredby_CLG_v1.0/Al_Lo_xy_reciprocal_best_hits.coloredby_CLG_v1.0.plotted.rbh, sp_to_chr_to_size.tsv
    output: output.pdf

RuleException:
NameError in file /public/home/biosoft/odp/scripts/odp_rbh_to_ribbon, line 376:
name 'sns' is not defined
  File "/public/home/biosoft/odp/scripts/odp_rbh_to_ribbon", line 651, in __rule_make_plot
  File "/public/home/biosoft/odp/scripts/odp_rbh_to_ribbon", line 376, in ribbon_plot
  File "/public/home/miniconda3/envs/odp/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

and my configure file

# There are several options for how to sort the chromosomes.
# More information is available in the config file.
chr_sort_order: optimal-chr-or 

# Tells the program whether to plot the non-significant interactions.
plot_all: True

# Specifies which species will be plotted from the top-to-bottom.
species_order:
  - Al
  - Lo

rbh_directory: /public/home/project/amphioxus/synteny/odp/odp/step2-figures/synteny_coloredby_CLG_v1.0/

# Only two species are shown here for brevity,
#  but please include the species information for all the species you wish to plot.
species:
  Al:
    proteins: /public/home/project/amphioxus/synteny/odp/prot/Alref.prot.fa
    chrom: /public/home/project/amphioxus/synteny/odp/chrom/Alref.chrom
    genome: /public/home/project/amphioxus/Refer/al_canu_ref_patch_HIC.tidy.masked.rename.fasta
    minscafsize: 20000000  # Only plots scaffolds that are 1 Mbp or longer
  Lo:
    proteins: /public/home/project/amphioxus/synteny/odp/prot/Lo.prot.fa
    chrom: /public/home/project/amphioxus/synteny/odp/chrom/Lo.chrom
    genome: /public/home/project/amphioxus/synteny/odp/genome/Lepisosteus_oculatus.LepOcu1.dna_sm.toplevel.fa
    minscafsize: 1000000  # Only plots scaffolds that are 1 Mbp or larger

I don't think there's any issue with my configuration or input files, so I'm not sure how to resolve the problem.

And another small problem in odp's output

image

If the tables were too long, the legend would overlap with them.

Looking forward to your reply!

conchoecia commented 1 year ago

If you go to the install directory for odp and execute the command git pull that should update the code. I made a fix where it should work now if you run odp_rbh_to_ribbon again.

Yes, the problem is that the PDF is a set size, but sometimes the table is too long. This is an open issue (https://github.com/conchoecia/odp/issues/14) if you have some code edits to fix this problem. I recommend opening the PDF in Adobe Illustrator or another vector image editor (Inkscape is one), to get the table if you need it for something.

ArthurSilver commented 1 year ago

If you go to the install directory for odp and execute the command git pull that should update the code. I made a fix where it should work now if you run odp_rbh_to_ribbon again.

Yes, the problem is that the PDF is a set size, but sometimes the table is too long. This is an open issue (#14) if you have some code edits to fix this problem. I recommend opening the PDF in Adobe Illustrator or another vector image editor (Inkscape is one), to get the table if you need it for something.

Thanks again, it finally worked! I really appreciate your help.

conchoecia commented 1 year ago

Glad it worked for you and thanks for posting about the issues! I will work on making the experience more streamlined. Stay tuned for updates.