arzwa / py-adhore

Tools for performing I-ADHoRe co-linearity inference and visualization
3 stars 0 forks source link

og.csv #1

Closed mictadlo closed 4 years ago

mictadlo commented 4 years ago

Hi, How is it possible to create og.csv?

Thank you in advance,

Michal

arzwa commented 4 years ago

Hi Michal,

You can use the Orthogroups.tsv file from the OrthoFinder output. If you have a directory with fasta files containing protein coding genes for your species, like so

$ ls ../data/tfa/*fasta          
./ahy.cds.fasta  ./bvu.cds.fasta  ./dca.cds.fasta  ./dsp.cds.fasta  ./ugi.cds.fasta
./ath.cds.fasta  ./cpa.cds.fasta  ./dmu.cds.fasta  ./ptr.cds.fasta  ./vvi.cds.fasta
./ave.cds.fasta  ./cqu.cds.fasta  ./dre.cds.fasta  ./sly.cds.fasta

you can run orthofinder using orthofinder -f . -og (this will infer gene families). In the output directory you will find the OrthoGroups.tsv file. If you have the right GFF files with entries for all genes in the relevant fasta files you can then use py-adhore using

py-adhore of -f <choose relevant feature> -a <choose relevant attribute> <species names (columns in OrthoGroups.tsv to use> gff_file1 gff_file2 ...  

Check the example files in the example directory and the commands used as shown in the README for an example of how you can format the command you need. Hope this helps!

mictadlo commented 4 years ago

Hi Arthur, Thank you for your resoponse. Do you think that https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/4096/100/GCF_000393655.1_Nsyl/GCF_000393655.1_Nsyl_genomic.gff.gz will work with py-adhore?

Thank you in advance,

Michal

arzwa commented 4 years ago

Sure, what you need is a file with the amino acid sequence data for the coding sequences (CDS) in this genome, and then you can use this gff file like for the example data.

However, if you plan to only look at synteny within a single genome, it might be easier to use the wgd syn tool in my wgd package. You can then use wgd dmd <your coding DNA sequences> to get the paralogous gene families and wgd syn to obtain the i-ADHoRe output. But you can also use OrthoFinder + py-adhore in the manner described here.

mictadlo commented 4 years ago

Hi Arthur, I ran py-adhore in the following way:

$ py-adhore of Orthogroups.tsv NlabM,Nsyl NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3 GCF_000393655.1_Nsyl_genomic.gff -f gene -n 4 --run
2020-08-10 22:16:44: WARNING    Output directory `py-adhore.out` already exists
2020-08-10 22:16:44: WARNING    # features different from # species, will use gene for all species
2020-08-10 22:16:44: WARNING    # attributes different from # species, will use ID for all species
2020-08-10 22:16:44: INFO   Species and their gff specs:
2020-08-10 22:16:44: INFO    .. NlabM: feat. = ['gene']; attr. = ID
2020-08-10 22:16:44: INFO    .. Nsyl: feat. = ['gene']; attr. = ID
2020-08-10 22:16:44: INFO   Writing families file
2020-08-10 22:17:07: INFO   Writing gene lists
2020-08-10 22:17:07: INFO   Loading NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3 into database ... 
2020-08-10 22:17:13: WARNING    Directory `py-adhore.out/NbRNASeqAll.sorted-proper.add-rg.bam.dedup.bam.gtf.fasta.transdecoder.genome.FixVSaugustus.hints_utrAGAT-swissprot-and-NR.no_TE.no_unchar_hypo.newID.no_remark.gff3_lists` already exists, will possibly overwrite
2020-08-10 22:17:14: INFO   100.00% of genes not found in families
2020-08-10 22:17:14: INFO   Loading GCF_000393655.1_Nsyl_genomic.gff into database ... 
2020-08-10 22:17:20: WARNING    Directory `py-adhore.out/GCF_000393655.1_Nsyl_genomic.gff_lists` already exists, will possibly overwrite
2020-08-10 22:17:21: INFO   99.74% of genes not found in families
2020-08-10 22:17:21: INFO   Writing gene-based karyotype
2020-08-10 22:17:21: INFO   Writing I-ADHoRe 3.0 configuration file
2020-08-10 22:17:21: INFO   Running I-ADHoRe [`i-adhore py-adhore.out/adhore.conf`]

This is i-ADHoRe v3.0.
Copyright (c) 2002-2010, Flanders Interuniversity Institute for Biotechnology, VIB.
Algorithm designed by Klaas Vandepoele, Cedric Simillion, Jan Fostier, Dieter De Witte,
Koen Janssens, Sebastian Proost, Yvan Saeys and Yves Van de Peer.

Process 1/1 is alive on localhost.

WARNING: Maximum allowed number of gaps in the alignment not specified.  Setting to cluster_gap.
WARNING: Tandem gap size not correct in settings file. Using default (gap_size / 2)

************* i-ADHoRe parameters *************
    Number of genelists = 1
    Blast table = /mnt/volume1/orthofinder/OrthoFinder/Results_Aug04/py-adhore/py-adhore.out/families.tsv
    Output path = /mnt/volume1/orthofinder/OrthoFinder/Results_Aug04/py-adhore/py-adhore.out/i-adhore-out/
    Gap size = 30
    Cluster gap size = 35
    Cloud gap size = 0
    Cloud cluster gap size = 0
    Max gaps in alignment = 35
    Tandem gap = 15
    Flush output = 1000
    Q-value = 0.75
    Anchorpoints = 3
    Probability cutoff = 0.01
    Cloud filtering method = Binomial
    Level 2 only = false
    Use family = true
    Write statistics = false
    Alignment method = GreedyGraphbased4
    Multiple hypothesis correction = FDR
    Number of threads = 4
    Compare aligners = false
    Collinear searches only
    Visualize GHM.png = false
    Visualize Alignment = false
    Verbose output = true
************ END i-AdDHoRe parameters *********

Creating dataset...         done. (time: 9.20296e-05s)
Mapping gene families...        done. (time: 0.080822s)
Remapping tandem duplicates...  done. (time: 0.00851798s)
Writing genelists file...       done. (time: 0.000352144s)
Collinear Search
Level 2 multiplicon detection...    done. (time: 0.000519037s)
Profile detection...
1 multiplicons to evaluate - evaluating level 2 multiplicon... 0 new multiplicons found.
Flushing output files...done.
Time for Higher Level Detection: 0.000602961s.

All Done!  Bye...

2020-08-10 22:17:22: INFO   These were the parameters for I-ADHoRe: 
--number_of_threads '4' --gap_size '30' --cluster_gap '35' --q_value '0.75' --prob_cutoff '0.01' --anchor_points '3' --level_2_only 'false' --alignment_method 'gg2' 

To visualize it, I used this command

$ py-adhore cc --js py-adhore.out/i-adhore-out/segments.txt ./py-adhore.out/genes_data.csv

As soon I open this in the html side in broweser I get only:

Based on py-adhore.out/i-adhore-out/segments.txt and ./py-adhore.out/genes_data.csv, showing all elements with blocks > 5000000.0 and of total length > 1000000.0 all chords > 1000000.0 (all = False). Only multiplicons of order > 1 are shown.

The content of circos.html is

$ cat py-adhore.out/circos/circos.html 
<!DOCTYPE html><html><head><script src='https://cdn.rawgit.com/nicgirault/circosJS/v2/dist/circos.js'></script><link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons"> <link rel="stylesheet" href="https://code.getmdl.io/1.3.0/material.indigo-pink.min.css"> <script defer src="https://code.getmdl.io/1.3.0/material.min.js"></script>
    </head><body><div><p>Based on py-adhore.out/i-adhore-out/segments.txt and ./py-adhore.out/genes_data.csv, showing all elements with blocks > 5000000.0 and of total length > 1000000.0 all chords > 1000000.0 (all = False). Only multiplicons of order > 1 are shown.</p></div><div width='100%'><svg width='1500' height='1500' id='chart'></svg></div><script>
    var thecolors= ['#5D8AA8', '#FE6F5E', '#177245', '#B22222', '#4F7942',
        '#86608E','#FCC200', '#ffe119', '#911eb4', '#46f0f0', '#f032e6',
        '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324',  '#808000',
        '#ffd8b1', '#000075', '#808080']

    var myCircos = new Circos({
        container: '#chart',
        width: 1500,
        height: 1500,
    });

    var configuration = {
        tooltipContent: function (d) {
            return d.label
        },
        innerRadius: 500,
        outerRadius: 520,
        cornerRadius: 0,
        gap: 0.01, // in radian
        labels: {
            display: false,
            position: 'center',
            size: '14',
            color: '#000000',
            radialOffset: 80,
        },
        ticks: {
            display: false,
            color: 'grey',
            spacing: 10000000,
            labels: true,
            labelSpacing: 10,
            labelSuffix: 'Mb',
            labelDenominator: 1000000,
            labelDisplay0: true,
            labelSize: '10px',
            labelColor: '#000000',
            labelFont: 'default',
            majorSpacing: 5,
            size: {
                minor: 2,
                major: 5,
            }
        },
        events: {}
    }

var layout_data = []
myCircos.layout(layout_data, configuration);
myCircos.chords('synteny', [], {'color': function(datum, index) { return thecolors[datum.value] }, opacity: 0.5});

What did I miss?

Thank you in advance,

Michal

arzwa commented 4 years ago

Hi,

This part of the output should give you a hint that some setting isn't quite right:

2020-08-10 22:17:14: INFO   100.00% of genes not found in families
2020-08-10 22:17:14: INFO   Loading GCF_000393655.1_Nsyl_genomic.gff into database ... 
2020-08-10 22:17:20: WARNING    Directory `py-adhore.out/GCF_000393655.1_Nsyl_genomic.gff_lists` already exists, will possibly overwrite
2020-08-10 22:17:21: INFO   99.74% of genes not found in families

Are you sure that the gene IDs in the Orthogroups file match the IDs in the ID field of the gene entries in the gff file? For instance if you have a gene in your OrthoGroups file with ID GSVIVG01026152001 and your gff file contains the following associated entries:

$ grep GSVIVG01026152001 vvi.gff 
chr10   JGI 12xMarch2010    gene    13065877    13074201    .   -   .   ID=GSVIVG01026152001;pacid=17833981;id=GSVIVG01026152001.Genoscope12X;tid=GSVIVT01026152001.Genoscope12X,GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10   JGI 12xMarch2010    mRNA    13065877    13074201    .   -   .   ID=GSVIVT01026152001;Parent=GSVIVG01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10   JGI 12xMarch2010    exon    13065877    13065939    .   -   .   ID=GSVIVT01026152001:exon:1;Parent=GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
chr10   JGI 12xMarch2010    three_prime_UTR 13065877    13065939    .   -   .   ID=GSVIVT01026152001:three_prime_UTR;Parent=GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001
...

you see that the relevant line is:

chr10   JGI 12xMarch2010    gene    13065877    13074201    .   -   .   ID=GSVIVG01026152001;pacid=17833981;id=GSVIVG01026152001.Genoscope12X;tid=GSVIVT01026152001.Genoscope12X,GSVIVT01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001

or the mRNA entry would also work:

chr10   JGI 12xMarch2010    mRNA    13065877    13074201    .   -   .   ID=GSVIVT01026152001;Parent=GSVIVG01026152001;Name=GSVIVG01026152001;gene_id=GSVIVG01026152001

the right settings are here then either -f gene -a Nameor-f mRNA -a Nameor-f mRNA -a IDor-f mRNA -a Parent` (each of these would work) you see?

mictadlo commented 4 years ago

Hi Arthur, Thank you, I changed to mRNA and it worked with the following commands:

$ py-adhore of Orthogroups.tsv NlabM,Nsyl Nblab.gff Nsyl.gff -f mRNA -n 4 --run
$ py-adhore cc --js py-adhore.out/i-adhore-out/segments.txt ./py-adhore.out/genes_data.csv

circos

How is it possible for example to add chromosome names on the circos plot?

$ head Orthogroups.tsv 
Orthogroup  Natt    NlabM   NlabU   Noto    Nsyl    NtabC   NtabS   Ntom    QldM    QldU    Slyco
OG0000000   rna-XM_019368520.1, rna-XM_019369806.1, rna-XM_019369900.1, rna-XM_019370033.1, rna-XM_019370109.1, rna-XM_019370275.1, rna-XM_019372327.1, rna-XM_019372328.1, rna-XM_019373341.1, rna-XM_019373708.1, rna-XM_019375707.1, rna-XM_019375877.1, rna-XM_019376095.1, rna-XM_019376377.1, rna-XM_019377166.1, rna-XM_019377741.1, rna-XM_019381751.1, rna-XM_019381905.1, rna-XM_019384034.1, rna-XM_019384360.1, rna-XM_019384494.1, rna-XM_019384495.1, rna-XM_019384989.1, rna-XM_019387012.1, rna-XM_019388947.1, rna-XM_019388958.1, rna-XM_019390778.1, rna-XM_019391032.1, rna-XM_019391033.1, rna-XM_019392893.1, rna-XM_019393234.1, rna-XM_019395412.1, rna-XM_019395564.1, rna-XM_019395777.1, rna-XM_019395782.1, rna-XM_019397264.1, rna-XM_019399662.1, rna-XM_019400752.1, rna-XM_019400878.1, rna-XM_019402147.1, rna-XM_019402189.1, rna-XM_019402197.1, rna-XM_019402715.1, rna-XM_019405106.1, rna-XM_019406058.1, rna-XM_019407647.1, rna-XM_019408338.1, rna-XM_019408374.1, rna-XM_019408688.1, rna-XM_019408775.1, rna-XM_019408917.1, rna-XM_019408953.1, rna-XM_019409304.1, rna-XM_019409787.1, rna-XM_019409841.1, rna-XM_019410487.1, rna-XM_019410682.1, rna-XM_019411313.1, rna-XM_019411401.1            NBlab01G00190.1, NBlab01G03030.1, NBlab01G08410.2, NBlab01G09780.1, NBlab01G09780.2, NBlab01G14300.1, NBlab01G18340.1, NBlab01G20720.1, NBlab01G25630.1, NBlab01G27000.1, NBlab01G28170.2, NBlab01G28170.4, NBlab01G28170.5, NBlab01G31030.1, NBlab02G18800.1, NBlab02G19590.2, NBlab02G19590.4, NBlab02G19590.6, NBlab03G06920.1, NBlab03G14470.1, NBlab03G15220.1, NBlab03G20280.1, NBlab03G23040.1, NBlab03G26070.1, NBlab03G29850.1, NBlab04G07830.1, NBlab04G10060.1, NBlab04G10550.1, NBlab04G10830.1, NBlab04G16130.1, NBlab04G22300.1, NBlab04G24190.1, NBlab04G25750.1, NBlab05G01430.1, NBlab05G10100.1, NBlab05G17450.1, NBlab05G17450.2, NBlab05G25240.1, NBlab06G12230.1, NBlab06G14120.1,

Nsyl.gff example content:

NW_009338805.1  Gnomon  gene    10570   12535   .       -       .       ID=gene-LOC104217587;Dbxref=GeneID:104217587;Name=LOC104217587;gbkey=Gene;gene=LOC1042
17587;gene_biotype=protein_coding
NW_009338805.1  Gnomon  mRNA    10570   12535   .       -       .       ID=rna-XM_009770987.1;Parent=gene-LOC104217587;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;Name=XM_009770987.1;gbkey=mRNA;gene=LOC104217587;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 2 samples with support for all annotated introns;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    12140   12535   .       -       .       ID=exon-XM_009770987.1-1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11826   11939   .       -       .       ID=exon-XM_009770987.1-2;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    11521   11695   .       -       .       ID=exon-XM_009770987.1-3;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  exon    10570   10889   .       -       .       ID=exon-XM_009770987.1-4;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XM_009770987.1;gbkey=mRNA;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;transcript_id=XM_009770987.1
NW_009338805.1  Gnomon  CDS     12140   12154   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11826   11939   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     11521   11695   .       -       0       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1
NW_009338805.1  Gnomon  CDS     10813   10889   .       -       2       ID=cds-XP_009769289.1;Parent=rna-XM_009770987.1;Dbxref=GeneID:104217587,Genbank:XP_009769289.1;Name=XP_009769289.1;gbkey=CDS;gene=LOC104217587;product=ribosome-interacting GTPase 1-like;protein_id=XP_009769289.1

NLab.gff content.

NbV1Ch01        AUGUSTUS        gene    1695599 1697648 0.22    -       .       ID=NBlab01G00190
NbV1Ch01        AUGUSTUS        mRNA    1695599 1697648 0.22    -       .       ID=NBlab01G00190.1;Note=protein FAR1-RELATED SEQUENCE 5-like;Parent=NBlab01G00190
NbV1Ch01        AUGUSTUS        TSS     1697648 1697648 .       -       .       ID=NBlab01G00190.1.TSS1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        exon    1695599 1696090 .       -       .       ID=NBlab01G00190.1.exon1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        exon    1696719 1697035 .       -       .       ID=NBlab01G00190.1.exon2;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        exon    1697175 1697648 .       -       .       ID=NBlab01G00190.1.exon3;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        CDS     1695641 1696090 1       -       0       ID=NBlab01G00190.1.CDS1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        CDS     1696719 1697035 0.84    -       2       ID=NBlab01G00190.1.CDS2;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        CDS     1697175 1697610 0.94    -       0       ID=NBlab01G00190.1.CDS3;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        five_prime_utr  1697611 1697648 0.38    -       .       ID=NBlab01G00190.1.5UTR1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        intron  1696091 1696718 0.85    -       .       ID=NBlab01G00190.1.intron1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        intron  1697036 1697174 0.94    -       .       ID=NBlab01G00190.1.intron2;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        start_codon     1697608 1697610 .       -       0       ID=NBlab01G00190.1.SCodon1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        stop_codon      1695641 1695643 .       -       0       ID=NBlab01G00190.1.ECodon1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        three_prime_utr 1695599 1695640 0.72    -       .       ID=NBlab01G00190.1.3UTR1;Parent=NBlab01G00190.1
NbV1Ch01        AUGUSTUS        transcription_end_site  1695599 1695599 .       -       .       ID=NBlab01G00190.1.TES1;Parent=NBlab01G00190.1

Thank you in advance,

Michal

arzwa commented 4 years ago

Hi Michal, the visualization utilities are not well-developed at this moment, so this is not possible without messing around with the circos.js methods used under the hood (I mainly wrote this stuff for some quick visualizations of the I-ADHoRe output...). Note that you can generate configuration files for circos when not using the --js, which you could edit for your purposes.