PombertLab / SYNY

The SYNY pipeline investigates synteny between species by reconstructing protein clusters from gene pairs.
MIT License
29 stars 4 forks source link

Running error #1

Closed MoraRuizPablo closed 3 months ago

MoraRuizPablo commented 3 months ago

HI, I have downloaded couple of gbff files but when running this wonderful pipeline I have this error message:

Infering colinearity from protein-coding gene clusters

The number of annotation files (2) does not equal the number of protein files (1)

And I was able to run with the examples files you provide, can you help me?

Pombert-JF commented 3 months ago

Can you send me the link to those gbff files? Would need your command line as well to pinpoint the problem

MoraRuizPablo commented 3 months ago

sure:

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/910/592/335/GCA_910592335.1_icAdaBipu1.1/

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/951/802/165/GCA_951802165.1_icAdaDece3.1/

and the command is this one:

run_syny.pl -t 8 -a /mnt/c/Users/Pablo\ Mora\ Ruiz/Desktop/new/*.gbff.gz -o Prueba

Pombert-JF commented 3 months ago

So I tested with the annotation files linked. None of these gbff contain any annotation. They contain the genomes but that's it. If you run it with the --no_clus option, it will run only the genome alignment step but not the cluster step. Looks like I need to implement a check for when genome annotations are blank.

I'm also curious about how similar (or not) these genomes are. When I ran the alignment. nothing seems to map: colinear_bases mmap heatmap

I checked and the PAF files are empty.

Pombert-JF commented 3 months ago

Ran a quick MASH analysis on these genomes: GCA_910592335.1_icAdaBipu1.1_genomic.fna GCA_910592335.1_icAdaBipu1.1_genomic.fna 0 0 1000/1000 GCA_951802165.1_icAdaDece3.1_genomic.fna GCA_910592335.1_icAdaBipu1.1_genomic.fna 0.0867434 1.0634e-247 88/1000 GCA_910592335.1_icAdaBipu1.1_genomic.fna GCA_951802165.1_icAdaDece3.1_genomic.fna 0.0867434 1.0634e-247 88/1000 GCA_951802165.1_icAdaDece3.1_genomic.fna GCA_951802165.1_icAdaDece3.1_genomic.fna 0 0 1000/1000

Looks like they are quite divergent at the sequence level. Could explain the lack of mapping with minimap2

Pombert-JF commented 3 months ago

Seems like the blank PAF files are due to minimap2 hitting the RAM ceiling on my laptop. MashMap generates non-blank PAF files from the same input (it runs within a much smaller memory footprint). Never really tested it however. Will test it and look into implementing support for it later this week.

MoraRuizPablo commented 3 months ago

Many thanks!

Pombert-JF commented 3 months ago

The annotation files (2) does not equal the number of protein files (1) issue is now fixed, should not happen again. I implemented the option to use mashmap (https://academic.oup.com/bioinformatics/article/39/9/btad512/7246743) as the aligner instead of minimap2 as well and pushed the changes on GitHub. It can be installed automatically with setup_syny.pl.

I also added the option to exclude contigs by regular expressions. Your dataset included a lot of tiny contigs that caused plotting issues. I guess you could exclude them with --minsize too, but their names were clearly different from that of the main chromosomes so I figured that having the option to exclude contigs with regexes could come in handy.

Ran the new version with mashmap on your data: run_syny.pl -a GCA_910592335.1_icAdaBipu1.1_genomic.gbff.gz GCA_951802165.1_icAdaDece3.1_genomic.gbff.gz -o SYNY_mashmap --aligner mashmap --exclude CAT CAJUZD --no_circos --threads 16 --mpid 85

If using the default percentage identity (--mpid 85), it runs in about 4-5 Gb of RAM on your dataset. Lowering that threshold does crank up RAM usage quite a bit...

For comparisons, minimap2 alignments peaked at over 100 Gb of RAM on one of my servers (explains why my laptop was struggling):

run_syny.pl -a GCA_910592335.1_icAdaBipu1.1_genomic.gbff.gz GCA_951802165.1_icAdaDece3.1_genomic.gbff.gz -o SYNY_minimap --aligner minimap --exclude CAT CAJUZD --no_circos --threads 16

[[M::main] CMD: minimap2 -t 16 -c --cs=long 
/home/jpombert/TEST_SYNY/SYNY_minimap/SEQUENCES/GENOMES/GCA_910592335.fasta 
/home/jpombert/TEST_SYNY/SYNY_minimap/SEQUENCES/GENOMES/GCA_951802165.fasta
[M::main] Real time: 8692.258 sec; CPU: 35543.055 sec; Peak RSS: 103.152 GB

If you check the dotplots, it is a mess, same for the barplots (see below). Most likely due to repeats. I could potentially parse the minimap2 PAF files to remove small and/or low identity pieces (you can glimpse the collinear blocks under that sea of dots), but the MashMap PAF files are not real alignments and each chunk is the same size. Need to read more into it. Inferring synteny from orthologous gene clusters would really help here if the annotations were available.

mashmap: GCA_910592335_vs_GCA_951802165 mmap 1e5 19 2x10 8 blue_mashmap

minimap2: GCA_910592335_vs_GCA_951802165 mmap 1e7 19 2x10 8 blue_minimap2

mashmap: GCA_910592335_vs_GCA_951802165 mmap barplot 19 2x10 8 Spectral_mashmap

minimap2: GCA_910592335_vs_GCA_951802165 mmap barplot 19 2x10 8 Spectral_minimap2

MoraRuizPablo commented 3 months ago

Many thanks for your reply and your effort in order to improve the tool (I'm happy to see that somehow I have "helped" you). I will try to run it trying also to generate the circos (the most interesting part for me now). I have done another kind of analysis using BUSCO genes and I somehow have the idea about the colinearity between genomes (at least what chromosomes are involved in karyotype differences).

Many thanks again