conchoecia / odp

oxford dot plots
GNU General Public License v3.0
132 stars 10 forks source link

ALG with odp_nway_rbh fails with blast #21

Closed tauanajc closed 1 year ago

tauanajc commented 1 year ago

Dear Darrin,

I contact Oleg about doing synteny analyses similar to the 2022 metazoan paper and he told me about odp. It looks like a fantastic tool and I'm excited to be trying it.

I'm trying the second use case to get ALGs from 3 NCBI genomes, the lancelet, scallop and jellyfish. My first attempt was with diamond, but there was an error pointing to line 242 of the odp_functions.py script. I fixed it by replacing illegal = set("prot_to_loc", "prot-to-loc") with illegal = set().

Steps 1 and 2 then ran fine! However, the list of orthogroups in the reciprocal_best_hits.rbh.groupby was relatively short compared to the metazoan paper (2648 groups, instead of >6000). I decided to try blast instead of diamond, and also to increase the number of permutations (there was no example in the github, so I originally used num_permutations: 1000000, and now increased to 100000000). Running with blast quickly fails. The end of the log file is below, and I attach the config and log files:

Waiting at most 5 seconds for missing files. MissingOutputException in rule diamond_blast_x_to_y in file /home/FM/tcunha/scripts/odp/scripts/odp_nway_rbh, line 165: Job 9 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait: odp_nway_rbh/step0-blastp_results/Bfloridae_against_Myessoensis.blastp Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2023-04-26T163218.427529.snakemake.log Here is an example of the first few analyses: [['Bfloridae', 'Hvulgaris', 'Myessoensis']] There are 1 possible combinations.

step1.log config.yaml.txt

I would love to have it running on blast, but also any ideas about how to reproduce the >6000 groups would be appreciated! Thanks!

conchoecia commented 1 year ago

Hi Tauana,

Is your only goal to recreate the analyses of Simakov et al 2022? If I know what your goals are maybe I can better answer your question.

tauanajc commented 1 year ago

Thanks for the quick reply Darrin! I actually have a new genome that I would like to add to the analyses. It has only 4 chromosomes, so my ultimate goal is to have a ribbon plot (let's say my species + scallop, lancelet, hydra, perhaps Celegans), and a dotplot of my species against the BLG (like the nematodes in Figure S10 of the 2022 paper), so that I can understand how the ancestral bilaterian linkage groups have evolved to form these 4 chromosomes.

tauanajc commented 1 year ago

Hi again Darrin, I run the macrosynteny use case and that actually gave me most of what I wanted! I thought I had to infer the ALG from scratch, but using that example gave me the dotplots against the BCnS_LGs (bilaterian linkage groups correct?) as desired. One issue is that I did set up the config.yaml parameters for genus, species, and plotorder, but it doesn't seem to have had any effect: the plots still have the abbreviation names and all of the scaffolds (instead of the 4 I specified) in some other arrangement. Any ideas of how to fix that?

Command: snakemake --cores 16 --snakefile ../../scripts/odp/scripts/odp > minimal-ppacificus.log 2>&1 & config.yaml.txt

Is there another parameter for the config file that would generate ribbon plots?

Thanks!

conchoecia commented 1 year ago

I thought I had to infer the ALG from scratch, but using that example gave me the dotplots against the BCnS_LGs (bilaterian linkage groups correct?) as desired.

Yes, that's right - the way the software is implemented now there is a set of BCnS ALGs that each analysis with the odp script is compared to using HMMs. This isn't the same exact set of proteins as the Simakov et al. 2022 paper - this set is more restrictive in that I defined it using strict reciprocal best orthologs from more species.

One issue is that I did set up the config.yaml parameters for genus, species, and plotorder, but it doesn't seem to have had any effect: the plots still have the abbreviation names and all of the scaffolds (instead of the 4 I specified) in some other arrangement. Any ideas of how to fix that?

I think plotorder is broken at the moment due to a recent change in how chromosomes are sorted along the x-axis. It sounds like maybe the software is plotting a lot of small scaffolds, in addition to the 4 chromosome scale ones. Can you figure out your smallest scaffold that you want to be plotted, then add this bit to that species' entry? minscafsize: 5000000 will only plot scaffolds 5 Mbp or larger.

  AAU:
    proteins: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU-proteins.fa
    chrom: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU.chrom
    genome: /home/FM/tcunha/nematomorpha-synteny/data/AAU/AAU-assembly.fa
    minscafsize: 5000000

Command: snakemake --cores 16 --snakefile ../../scripts/odp/scripts/odp > minimal-ppacificus.log 2>&1 &

Is there another parameter for the config file that would generate ribbon plots?

It is a bit cumbersome at the moment to make these plots, but easier than other solutions out there. Working on making this easier though! Please look at this config file and it should be enough to run the script. Make sure you point to the directory with the BCnS_LGs .rbh files to see the provenance of your species' 4 chromosomes in relation to those ALGs.

cp ../../example_configs/CONFIG_rbh_to_subway.yaml ./
mv CONFIG_rbh_to_subway.yaml config.yaml
# make your edits to the config.yaml file
snakemake --cores 16 --snakefile ../../scripts/odp/scripts/rbh_to_subway

Thanks!

Sure, please let me know if you have suggestions for the documentation or anything else to improve the user experience.

tauanajc commented 1 year ago

Thank you very much Darrin! I do have some small suggestions for documentation, but I will come back for that once I'm done with the analysis.

I am running odp again on the set of species I want to use in the ribbon plot: Hydra, lancelet, drosophila, Celegans and my genome. Basically same command and config file as before, but now I am getting a new error related to colors in the plot (below). Is there a way I could bypass this?

Error in rule plot_synteny_of_ALGs_plus_species: jobid: 0 input: odp/step1-rbh/UnicellMetazoanLgs_Aaustraliensis_reciprocal_best_hits.hmm.D.FET.rbh, odp/step0-chromsize/species/Aaustraliensis.chromsi ze output: odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_xy_reciprocal_best_hits.plotted.rbh, odp/step2-figures/ALG-spec ies_plots/UnicellMetazoanLgs_Aaustraliensis_xy_synteny.pdf, odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_yx_synteny.pdf

RuleException: ValueError in file /home/FM/tcunha/scripts/odp/scripts/odp, line 1551: Invalid RGBA argument: '#8e7ec26' File "/home/FM/tcunha/scripts/odp/scripts/odp", line 2132, in rule_plot_synteny_of_ALGs_plus_species File "/home/FM/tcunha/scripts/odp/scripts/odp", line 1551, in synteny_plot_sheet File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 806, in table File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 343, in add_cell File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/table.py", line 94, in init File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 714, in init File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/_api/deprecation.py", line 454, in wrapper File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 85, in init__ File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 359, in set_facecolor File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/patches.py", line 347, in _set_facecolor File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/colors.py", line 299, in to_rgba File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/site-packages/matplotlib/colors.py", line 374, in _to_rgba_no_colorcycle File "/home/FM/tcunha/miniconda3/envs/odp/lib/python3.11/concurrent/futures/thread.py", line 58, in run Removing output files of failed job plot_synteny_of_ALGs_plus_species since they might be corrupted: odp/step2-figures/ALG-species_plots/UnicellMetazoanLgs_Aaustraliensis_xy_reciprocal_best_hits.plotted.rbh Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

beroe commented 1 year ago

The color Invalid RGBA argument: '#8e7ec26' needs to be 8 digits long. It should probably have another 6 at the end.

conchoecia commented 1 year ago

@tauanajc - I found the source of the error. I think it was caused by some software automatically incrementing the numbers at the end of the color string in one of the ALG database files.

The easiest fix right now is to delete the files called UnicellMetazoanLgs and UnicellMetazoanLgsOrthofinder from the LG_db folder. After doing this it should work.

tauanajc commented 1 year ago

Hi Darrin, sorry for the late reply. I don't have a LG_db folder, so I removed all files with UnicellMetazoanLgs* in the entire analysis, but the run still failed (perhaps I deleted files that shouldn't have). The dot plots should be sufficient for what I need at the moment, so I will try the ribbon plots again later on. Thanks again for all the help.

How should I cite opd at the moment? And you had mentioned that the set of species used to obtain the ALGs is slightly different than that in the Simakov 2022 paper right? Just for reference, could you please share which taxa odp uses? I am assuming the same phyla in the paper, Porifera, Cnidaria, and representatives of Bilateria.

Thanks!

conchoecia commented 1 year ago

Hi Tauana, the folder will be where you installed odp, so odp/LG_db/UnicellMetazoan*. I think they should be there, because I the BCnS_LGs plots would not have appeared otherwise. I am guessing that you following the install instructions and ran make.

I will work on making the interface easier to work with, but for now please let me know if you would like to get the subway plots/ribbon diagrams working properly for your species. I am also working on improving the implementation of this tool to make it easier for people to use.

The taxa used to generate the BCnS_LGs database were Branchiostoma floridae, Ephydatia muelleri, Hydra vulgaris, Pecten yessoensis, and Rhopilema esculentum.

conchoecia commented 1 year ago

Closed the issue, but please open again with new relevant info if need be.