PoonLab / ovrf-viz

Review article on overlapping reading frames in viruses
MIT License
2 stars 0 forks source link

Summary statistics for weighted graphs #16

Closed horaciobam closed 3 years ago

horaciobam commented 4 years ago

Undirected or directed.

horaciobam commented 4 years ago

More common summary statistics used for Network analysis:

Note: We should measure number of overlapping edges in relation with adjacent edges for all the network and for each node. From our .dot plot we can differentiate adjacent from overlapping edges based on color:

digraph {
    graph [outputorder=endgesfirst]
    1 [color="#F8766D" fixedsize=true fontname="Courier-Bold" fontsize=85 height=2.7284509239574835 style=filled width=2.7284509239574835]
    1 -> 2 [arrowsize=0.01 color=grey76 len=10 penwidth=22]
    1 -> 3 [arrowsize=0.01 color=grey76 len=10 penwidth=6]
    1 -> 12 [arrowsize=0.01 color=grey76 len=10 penwidth=5]
    1 -> 10 [arrowsize=0.01 color=grey76 len=10 penwidth=4]
    1 -> 1 [arrowsize=0.01 color=grey76 len=10 penwidth=3]
    1 -> 6 [arrowsize=0.01 color=grey76 len=10 penwidth=2]
    1 -> 8 [arrowsize=0.01 color=grey76 len=10 penwidth=2]
    1 -> 5 [arrowsize=0.01 color=grey76 len=10 penwidth=12]
    1 -> 4 [arrowsize=0.01 color=grey76 len=10 penwidth=3]
    1 -> 11 [arrowsize=0.01 color=grey76 len=10 penwidth=1]
    1 -> 7 [arrowsize=0.01 color=grey76 len=10 penwidth=2]
    1 -> 13 [arrowsize=0.01 color=grey76 len=10 penwidth=1]
    1 -> 7 [arrowsize=0.01 color="#143D59" len=10 penwidth=1]
    1 -> 1 [arrowsize=0.01 color="#143D59" len=10 penwidth=1]
    1 -> 10 [arrowsize=0.01 color="#143D59" len=10 penwidth=1]
    2 [color="#E18A00" fixedsize=true fontname="Courier-Bold" fontsize=85 height=3.5433819375782165 style=filled width=3.5433819375782165]
    2 -> 1 [arrowsize=0.01 color=grey76 len=10 penwidth=13]
    2 -> 7 [arrowsize=0.01 color=grey76 len=10 penwidth=7]
    2 -> 3 [arrowsize=0.01 color=grey76 len=10 penwidth=12]
    2 -> 8 [arrowsize=0.01 color=grey76 len=10 penwidth=13]
    2 -> 10 [arrowsize=0.01 color=grey76 len=10 penwidth=10]
    2 -> 4 [arrowsize=0.01 color=grey76 len=10 penwidth=7]
    2 -> 12 [arrowsize=0.01 color=grey76 len=10 penwidth=28]
    2 -> 13 [arrowsize=0.01 color=grey76 len=10 penwidth=10]
    2 -> 11 [arrowsize=0.01 color=grey76 len=10 penwidth=2]
    2 -> 2 [arrowsize=0.01 color=grey76 len=10 penwidth=6]
    2 -> 9 [arrowsize=0.01 color=grey76 len=10 penwidth=3]
    2 -> 6 [arrowsize=0.01 color=grey76 len=10 penwidth=2]
    2 -> 1 [arrowsize=0.01 color="#143D59" len=10 penwidth=1]
    2 -> 10 [arrowsize=0.01 color="#143D59" len=10 penwidth=1]

I can also get the statistics directly from the new_viz_ovrf.py script itself.

horaciobam commented 4 years ago

Try using NetworkX

horaciobam commented 4 years ago

Maybe use graph kernels?

horaciobam commented 4 years ago

Mean kmer distance (> mean(upper.tri(km))):

Family Mean kmer distance Baltimore group
Adenoviridae 0.4997828 dsDNA
Coronaviridae 0.4991135 (+) ssRNA
Mononegavirales 0.4997619 (-) ssRNA
Reoviridae 0.4994664 dsRNA
Retroviridae 0.4985549 Retrovirus
Rhabdoviridae 0.4995567 (-) ssRNA
horaciobam commented 4 years ago

Double check the results. Are data files correct? Interpreted as double instead of integers. Take upper.tri and get a limited part of it. Index the matrix itself.

horaciobam commented 4 years ago

Corrected mean kmer distance by using:

for (d in All) {
  up  <- upper.tri(d)
  m <- mean(d[up])
  print(m)
}
Family Mean kmer distance Baltimore group Host Number of complete ref genomes
Adenoviridae 0.2317234 dsDNA Human, Non-human vertebrate 72
Coronaviridae 0.2101119 (+) ssRNA Human, Non-human vertebrate 65
Mononegavirales 0.2559121 (-) ssRNA Human, Non-human vertebrate 327
Reoviridae 0.2877149 dsRNA Human, Non-human vertebrate, invertebrate, plants 887
Retroviridae 0.26761 Retrovirus Human, Non-human vertebrate 93
Rhabdoviridae 0.2489143 (-) ssRNA Human, Non-human vertebrate, invertebrate, plants 82
horaciobam commented 4 years ago

Full Table with all evaluated families in fam_analysis.csv on the data folder