PombertLab / SYNY

The SYNY pipeline investigates synteny between species by reconstructing protein clusters from gene pairs.
MIT License
29 stars 4 forks source link

Is it possible to compare only particular scaffolds instead of the whole genome? #3

Closed sa-andre closed 3 months ago

sa-andre commented 3 months ago

I am working with a set of genes that got scattered across the genome of fishes and I am looking for a way to compare such scaffolds to help define syntenic/ancestral regions shared across them. However, most fish genomes are on scaffold level, so I think following the pipeline by download the genome's gbff would result in >1k scaffolds being pairwise compared, which would be hard to visualize.

If i have a list of scaffolds that i know have my genes, is it possible to compared only among them? I was looking to find a way to download scaffolds as gbff but couldn't figure out how. Or does gb file would work aswell?

Pombert-JF commented 3 months ago

That should be easy to implement with a --include option. Working on it...

Pombert-JF commented 3 months ago

If you have a link to the .gbff files and a text file with the contigs you want, I'll test it on your dataset...

sa-andre commented 3 months ago

If you have a link to the .gbff files and a text file with the contigs you want, I'll test it on your dataset...

I apologize in advance cause I'm still really new to bioinformatics so I may be providing the wrong files, but one such comparison I intend to make is between Colossoma macropomum and Pygocentrus natteri. I have located the scaffolds possessing genes belonging to the gene family I am studying. A few of these scaffolds should show signs of synteny, if not all.

I am pasting bellow the ftp links to the genomes and the scaffolds ID. I am not sure wether the gbff file will refer to refseq ids (e.g. those containing NW) or to submitted id (e.g. CAJ), so I am pasting both. I hope this is correct.

C macropomum - GCF_904425465.1 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/904/425/465/GCA_904425465.1_Colossoma_macropomum/GCA_904425465.1_Colossoma_macropomum_genomic.gbff.gz

NW_023494793.1 - CAJGBK010000010.1 NW_023494809.1 - CAJGBK010000026.1 NW_023494810.1 - CAJGBK010000027.1 NW_023494787.1 - CAJGBK010000004.1 NW_023495332.1 - CAJGBK010000549.1 NW_023494785.1 - CAJGBK010000002.1

P natteri - GCF_015220715.1 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/015/220/715/GCA_015220715.1_fPygNat1.pri/GCA_015220715.1_fPygNat1.pri_genomic.gbff.gz

NC_051239.1 - CM026749.1 NC_051234.1 - CM026744.1 NC_051211.1 - CM026721.1 NC_051228.1 - CM026738.1

Pombert-JF commented 3 months ago

Pushed the new version to GitHub. Running SYNY on your dataset took about 3 minutes on my laptop using mashmap as the genome aligner and no circos plotting. The .gbff.gz files you linked did not contain annotations so there was no point in running gene cluster searches.

Command line used: run_syny.pl -a *.gbff.gz -outdir SUBSET --aligner mashmap -no_clus --no_circos --include names.txt

The content of names.txt is listed below. You could also use separate text files instead, e.g. --include names_1.txt names_2.txt. Basically the new version creates a database of contigs to keep if you invoke the --include option.

# Content of names.txt
CAJGBK010000010.1
CAJGBK010000026.1
CAJGBK010000027.1
CAJGBK010000004.1
CAJGBK010000549.1
CAJGBK010000002.1
CM026749.1
CM026744.1
CM026721.1
CM026738.1

There is a visual artefact in the doplots with CAJGBK010000549.1 resulting in spurious vertical/horizontal lines. That contig is really tiny compared to the other ones. Probably wouldn't see it if the contigs were to scale. Thinking about adding a scaled dotplots option to the todo list ...

GCA_015220715_vs_GCA_904425465 mmap 1e5 19 2x10 8 blue GCA_904425465_vs_GCA_015220715 mmap 1e5 19 2x10 8 blue

You can see how tiny that contig is compared to the other ones in the barplots (it is almost nearly invisible in the linemaps): GCA_904425465_vs_GCA_015220715 mmap barplot 19 2x10 8 Spectral

sa-andre commented 3 months ago

Wow thanks a lot, I will try it myself!

Pombert-JF commented 3 months ago

There is less noise in alignments produced with minimap2 but otherwise the outputs are congruent (took about 14 min and peaked at about 10 Gb of RAM).

Command line used: run_syny.pl -a *.gz -outdir SUBSET_minimap --aligner minimap -no_clus --no_circos --include names.txt

GCA_904425465_vs_GCA_015220715 mmap 1e5 19 2x10 8 blue minimap2

Pombert-JF commented 3 months ago

Will close this issue as resolved but let me know if you encounter issue(s).