Closed mictadlo closed 4 years ago
Hi Michal,
You can use the Orthogroups.tsv
file from the OrthoFinder output. If you have a directory with fasta files containing protein coding genes for your species, like so
$ ls ../data/tfa/*fasta
./ahy.cds.fasta ./bvu.cds.fasta ./dca.cds.fasta ./dsp.cds.fasta ./ugi.cds.fasta
./ath.cds.fasta ./cpa.cds.fasta ./dmu.cds.fasta ./ptr.cds.fasta ./vvi.cds.fasta
./ave.cds.fasta ./cqu.cds.fasta ./dre.cds.fasta ./sly.cds.fasta
you can run orthofinder using orthofinder -f . -og
(this will infer gene families). In the output directory you will find the OrthoGroups.tsv
file. If you have the right GFF files with entries for all genes in the relevant fasta files you can then use py-adhore
using
py-adhore of -f <choose relevant feature> -a <choose relevant attribute> <species names (columns in OrthoGroups.tsv to use> gff_file1 gff_file2 ...
Check the example files in the example
directory and the commands used as shown in the README for an example of how you can format the command you need. Hope this helps!
Hi Arthur, Thank you for your resoponse. Do you think that https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/4096/100/GCF_000393655.1_Nsyl/GCF_000393655.1_Nsyl_genomic.gff.gz will work with py-adhore?
Thank you in advance,
Michal
Sure, what you need is a file with the amino acid sequence data for the coding sequences (CDS) in this genome, and then you can use this gff file like for the example data.
However, if you plan to only look at synteny within a single genome, it might be easier to use the wgd syn
tool in my wgd
package. You can then use wgd dmd <your coding DNA sequences>
to get the paralogous gene families and wgd syn
to obtain the i-ADHoRe output. But you can also use OrthoFinder
+ py-adhore
in the manner described here.
Hi Arthur,
I ran py-adhore
in the following way:
$ py-adhore of Orthogroups.tsv NlabM,Nsyl NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3 GCF_000393655.1_Nsyl_genomic.gff -f gene -n 4 --run
2020-08-10 22:16:44: WARNING Output directory `py-adhore.out` already exists
2020-08-10 22:16:44: WARNING # features different from # species, will use gene for all species
2020-08-10 22:16:44: WARNING # attributes different from # species, will use ID for all species
2020-08-10 22:16:44: INFO Species and their gff specs:
2020-08-10 22:16:44: INFO .. NlabM: feat. = ['gene']; attr. = ID
2020-08-10 22:16:44: INFO .. Nsyl: feat. = ['gene']; attr. = ID
2020-08-10 22:16:44: INFO Writing families file
2020-08-10 22:17:07: INFO Writing gene lists
2020-08-10 22:17:07: INFO Loading NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3 into database ...
2020-08-10 22:17:13: WARNING Directory `py-adhore.out/NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3_lists` already exists, will possibly overwrite
2020-08-10 22:17:14: INFO 100.00% of genes not found in families
2020-08-10 22:17:14: INFO Loading GCF_000393655.1_Nsyl_genomic.gff into database ...
2020-08-10 22:17:20: WARNING Directory `py-adhore.out/GCF_000393655.1_Nsyl_genomic.gff_lists` already exists, will possibly overwrite
2020-08-10 22:17:21: INFO 99.74% of genes not found in families
2020-08-10 22:17:21: INFO Writing gene-based karyotype
2020-08-10 22:17:21: INFO Writing I-ADHoRe 3.0 configuration file
2020-08-10 22:17:21: INFO Running I-ADHoRe [`i-adhore py-adhore.out/adhore.conf`]
This is i-ADHoRe v3.0.
Copyright (c) 2002-2010, Flanders Interuniversity Institute for Biotechnology, VIB.
Algorithm designed by Klaas Vandepoele, Cedric Simillion, Jan Fostier, Dieter De Witte,
Koen Janssens, Sebastian Proost, Yvan Saeys and Yves Van de Peer.
Process 1/1 is alive on localhost.
WARNING: Maximum allowed number of gaps in the alignment not specified. Setting to cluster_gap.
WARNING: Tandem gap size not correct in settings file. Using default (gap_size / 2)
************* i-ADHoRe parameters *************
Number of genelists = 1
Blast table = /mnt/volume1/orthofinder/OrthoFinder/Results_Aug04/py-adhore/py-adhore.out/families.tsv
Output path = /mnt/volume1/orthofinder/OrthoFinder/Results_Aug04/py-adhore/py-adhore.out/i-adhore-out/
Gap size = 30
Cluster gap size = 35
Cloud gap size = 0
Cloud cluster gap size = 0
Max gaps in alignment = 35
Tandem gap = 15
Flush output = 1000
Q-value = 0.75
Anchorpoints = 3
Probability cutoff = 0.01
Cloud filtering method = Binomial
Level 2 only = false
Use family = true
Write statistics = false
Alignment method = GreedyGraphbased4
Multiple hypothesis correction = FDR
Number of threads = 4
Compare aligners = false
Collinear searches only
Visualize GHM.png = false
Visualize Alignment = false
Verbose output = true
************ END i-AdDHoRe parameters *********
Creating dataset... done. (time: 9.20296e-05s)
Mapping gene families... done. (time: 0.080822s)
Remapping tandem duplicates... done. (time: 0.00851798s)
Writing genelists file... done. (time: 0.000352144s)
Collinear Search
Level 2 multiplicon detection... done. (time: 0.000519037s)
Profile detection...
1 multiplicons to evaluate - evaluating level 2 multiplicon... 0 new multiplicons found.
Flushing output files...done.
Time for Higher Level Detection: 0.000602961s.
All Done! Bye...
2020-08-10 22:17:22: INFO These were the parameters for I-ADHoRe:
--number_of_threads '4' --gap_size '30' --cluster_gap '35' --q_value '0.75' --prob_cutoff '0.01' --anchor_points '3' --level_2_only 'false' --alignment_method 'gg2'
To visualize it, I used this command
$ py-adhore cc --js py-adhore.out/i-adhore-out/segments.txt ./py-adhore.out/genes_data.csv
As soon I open this in the html side in broweser I get only:
Based on py-adhore.out/i-adhore-out/segments.txt and ./py-adhore.out/genes_data.csv, showing all elements with blocks > 5000000.0 and of total length > 1000000.0 all chords > 1000000.0 (all = False). Only multiplicons of order > 1 are shown.
The content of circos.html
is
$ cat py-adhore.out/circos/circos.html
<!DOCTYPE html><html><head><script src='https://cdn.rawgit.com/nicgirault/circosJS/v2/dist/circos.js'></script><link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons"> <link rel="stylesheet" href="https://code.getmdl.io/1.3.0/material.indigo-pink.min.css"> <script defer src="https://code.getmdl.io/1.3.0/material.min.js"></script>
</head><body><div><p>Based on py-adhore.out/i-adhore-out/segments.txt and ./py-adhore.out/genes_data.csv, showing all elements with blocks > 5000000.0 and of total length > 1000000.0 all chords > 1000000.0 (all = False). Only multiplicons of order > 1 are shown.</p></div><div width='100%'><svg width='1500' height='1500' id='chart'></svg></div><script>
var thecolors= ['#5D8AA8', '#FE6F5E', '#177245', '#B22222', '#4F7942',
'#86608E','#FCC200', '#ffe119', '#911eb4', '#46f0f0', '#f032e6',
'#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#808000',
'#ffd8b1', '#000075', '#808080']
var myCircos = new Circos({
container: '#chart',
width: 1500,
height: 1500,
});
var configuration = {
tooltipContent: function (d) {
return d.label
},
innerRadius: 500,
outerRadius: 520,
cornerRadius: 0,
gap: 0.01, // in radian
labels: {
display: false,
position: 'center',
size: '14',
color: '#000000',
radialOffset: 80,
},
ticks: {
display: false,
color: 'grey',
spacing: 10000000,
labels: true,
labelSpacing: 10,
labelSuffix: 'Mb',
labelDenominator: 1000000,
labelDisplay0: true,
labelSize: '10px',
labelColor: '#000000',
labelFont: 'default',
majorSpacing: 5,
size: {
minor: 2,
major: 5,
}
},
events: {}
}
var layout_data = []
myCircos.layout(layout_data, configuration);
myCircos.chords('synteny', [], {'color': function(datum, index) { return thecolors[datum.value] }, opacity: 0.5});
What did I miss?
Thank you in advance,
Michal
Hi,
This part of the output should give you a hint that some setting isn't quite right:
2020-08-10 22:17:14: INFO 100.00% of genes not found in families
2020-08-10 22:17:14: INFO Loading GCF_000393655.1_Nsyl_genomic.gff into database ...
2020-08-10 22:17:20: WARNING Directory `py-adhore.out/GCF_000393655.1_Nsyl_genomic.gff_lists` already exists, will possibly overwrite
2020-08-10 22:17:21: INFO 99.74% of genes not found in families
Are you sure that the gene IDs in the Orthogroups file match the IDs in the ID
field of the gene
entries in the gff file? For instance if you have a gene in your OrthoGroups file with ID GSVIVG01026152001
and your gff file contains the following associated entries:
$ grep GSVIVG01026152001 vvi.gff
chr10 JGI 12xMarch2010 gene 13065877 13074201 . - . ID=GSVIVG01026152001;pacid=17833981;id=GSVIVG01026152001.Genoscope12X;tid=GSVIVT01026152001.Genoscope12X,GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10 JGI 12xMarch2010 mRNA 13065877 13074201 . - . ID=GSVIVT01026152001;Parent=GSVIVG01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10 JGI 12xMarch2010 exon 13065877 13065939 . - . ID=GSVIVT01026152001:exon:1;Parent=GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10 JGI 12xMarch2010 three_prime_UTR 13065877 13065939 . - . ID=GSVIVT01026152001:three_prime_UTR;Parent=GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
...
you see that the relevant line is:
chr10 JGI 12xMarch2010 gene 13065877 13074201 . - . ID=GSVIVG01026152001;pacid=17833981;id=GSVIVG01026152001.Genoscope12X;tid=GSVIVT01026152001.Genoscope12X,GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
or the mRNA entry would also work:
chr10 JGI 12xMarch2010 mRNA 13065877 13074201 . - . ID=GSVIVT01026152001;Parent=GSVIVG01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
the right settings are here then either -f gene
-a Nameor
-f mRNA -a Nameor
-f mRNA -a IDor
-f mRNA -a Parent` (each of these would work) you see?
Hi Arthur,
Thank you, I changed to mRNA
and it worked with the following commands:
$ py-adhore of Orthogroups.tsv NlabM,Nsyl Nblab.gff Nsyl.gff -f mRNA -n 4 --run
$ py-adhore cc --js py-adhore.out/i-adhore-out/segments.txt ./py-adhore.out/genes_data.csv
How is it possible for example to add chromosome names on the circos plot?
$ head Orthogroups.tsv
Orthogroup Natt NlabM NlabU Noto Nsyl NtabC NtabS Ntom QldM QldU Slyco
OG0000000 rna-XM_019368520.1, rna-XM_019369806.1, rna-XM_019369900.1, rna-XM_019370033.1, rna-XM_019370109.1, rna-XM_019370275.1, rna-XM_019372327.1, rna-XM_019372328.1, rna-XM_019373341.1, rna-XM_019373708.1, rna-XM_019375707.1, rna-XM_019375877.1, rna-XM_019376095.1, rna-XM_019376377.1, rna-XM_019377166.1, rna-XM_019377741.1, rna-XM_019381751.1, rna-XM_019381905.1, rna-XM_019384034.1, rna-XM_019384360.1, rna-XM_019384494.1, rna-XM_019384495.1, rna-XM_019384989.1, rna-XM_019387012.1, rna-XM_019388947.1, rna-XM_019388958.1, rna-XM_019390778.1, rna-XM_019391032.1, rna-XM_019391033.1, rna-XM_019392893.1, rna-XM_019393234.1, rna-XM_019395412.1, rna-XM_019395564.1, rna-XM_019395777.1, rna-XM_019395782.1, rna-XM_019397264.1, rna-XM_019399662.1, rna-XM_019400752.1, rna-XM_019400878.1, rna-XM_019402147.1, rna-XM_019402189.1, rna-XM_019402197.1, rna-XM_019402715.1, rna-XM_019405106.1, rna-XM_019406058.1, rna-XM_019407647.1, rna-XM_019408338.1, rna-XM_019408374.1, rna-XM_019408688.1, rna-XM_019408775.1, rna-XM_019408917.1, rna-XM_019408953.1, rna-XM_019409304.1, rna-XM_019409787.1, rna-XM_019409841.1, rna-XM_019410487.1, rna-XM_019410682.1, rna-XM_019411313.1, rna-XM_019411401.1 NBlab01G00190.1, NBlab01G03030.1, NBlab01G08410.2, NBlab01G09780.1, NBlab01G09780.2, NBlab01G14300.1, NBlab01G18340.1, NBlab01G20720.1, NBlab01G25630.1, NBlab01G27000.1, NBlab01G28170.2, NBlab01G28170.4, NBlab01G28170.5, NBlab01G31030.1, NBlab02G18800.1, NBlab02G19590.2, NBlab02G19590.4, NBlab02G19590.6, NBlab03G06920.1, NBlab03G14470.1, NBlab03G15220.1, NBlab03G20280.1, NBlab03G23040.1, NBlab03G26070.1, NBlab03G29850.1, NBlab04G07830.1, NBlab04G10060.1, NBlab04G10550.1, NBlab04G10830.1, NBlab04G16130.1, NBlab04G22300.1, NBlab04G24190.1, NBlab04G25750.1, NBlab05G01430.1, NBlab05G10100.1, NBlab05G17450.1, NBlab05G17450.2, NBlab05G25240.1, NBlab06G12230.1, NBlab06G14120.1,
Nsyl.gff example content:
NW_009338805.1 Gnomon gene 10570 12535 . - . ID=gene-LOC104217587;Dbxref=GeneID:104217587;Name=LOC104217587;gbkey=Gene;gene=LOC1042
17587;gene_biotype=protein_coding
NW_009338805.1 Gnomon mRNA 10570 12535 . - . ID=rna-XM_009770987.1;Parent=gene-LOC104217587;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;Name=XM_009770987.1;gbkey=mRNA;gene=LOC104217587;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1 Gnomon exon 12140 12535 . - . ID=exon-XM_009770987.1-1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1 Gnomon exon 11826 11939 . - . ID=exon-XM_009770987.1-2;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1 Gnomon exon 11521 11695 . - . ID=exon-XM_009770987.1-3;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1 Gnomon exon 10570 10889 . - . ID=exon-XM_009770987.1-4;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1 Gnomon CDS 12140 12154 . - 0 ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1 Gnomon CDS 11826 11939 . - 0 ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1 Gnomon CDS 11521 11695 . - 0 ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1 Gnomon CDS 10813 10889 . - 2 ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NLab.gff content.
NbV1Ch01 AUGUSTUS gene 1695599 1697648 0.22 - . ID=NBlab01G00190
NbV1Ch01 AUGUSTUS mRNA 1695599 1697648 0.22 - . ID=NBlab01G00190.1;Note=protein FAR1-RELATED SEQUENCE 5-like;Parent=NBlab01G00190
NbV1Ch01 AUGUSTUS TSS 1697648 1697648 . - . ID=NBlab01G00190.1.TSS1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS exon 1695599 1696090 . - . ID=NBlab01G00190.1.exon1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS exon 1696719 1697035 . - . ID=NBlab01G00190.1.exon2;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS exon 1697175 1697648 . - . ID=NBlab01G00190.1.exon3;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS CDS 1695641 1696090 1 - 0 ID=NBlab01G00190.1.CDS1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS CDS 1696719 1697035 0.84 - 2 ID=NBlab01G00190.1.CDS2;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS CDS 1697175 1697610 0.94 - 0 ID=NBlab01G00190.1.CDS3;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS five_prime_utr 1697611 1697648 0.38 - . ID=NBlab01G00190.1.5UTR1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS intron 1696091 1696718 0.85 - . ID=NBlab01G00190.1.intron1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS intron 1697036 1697174 0.94 - . ID=NBlab01G00190.1.intron2;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS start_codon 1697608 1697610 . - 0 ID=NBlab01G00190.1.SCodon1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS stop_codon 1695641 1695643 . - 0 ID=NBlab01G00190.1.ECodon1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS three_prime_utr 1695599 1695640 0.72 - . ID=NBlab01G00190.1.3UTR1;Parent=NBlab01G00190.1
NbV1Ch01 AUGUSTUS transcription_end_site 1695599 1695599 . - . ID=NBlab01G00190.1.TES1;Parent=NBlab01G00190.1
Thank you in advance,
Michal
Hi Michal, the visualization utilities are not well-developed at this moment, so this is not possible without messing around with the circos.js methods used under the hood (I mainly wrote this stuff for some quick visualizations of the I-ADHoRe output...). Note that you can generate configuration files for circos when not using the --js
, which you could edit for your purposes.
Hi, How is it possible to create
og.csv
?Thank you in advance,
Michal