Wuhan HBCDC-HB-01 variant class is unusually large - check composition

ArtPoon commented 4 years ago

I am starting to worry that we are incorrectly assigning sequences to this variant. (Currently there are over 13,000 samples assigned to this variant!) Although it is well known that some genotypes predominate, this is ridiculous!

One potential problem is that truncating the TN93 output to -t 0.0001 at the first step (to reduce output file size) excludes any samples that are 3 nt away from any other genome in the database, but I can't think of how that would overinflate this particular variant.

ArtPoon commented 4 years ago

Variant assignment is definitely broken.

ArtPoon commented 4 years ago

I have a suspicion on what is wrong. It is based on how we are using TN93 distances to compile clusters of "identical" genomes. Suppose we start with a sequence AAA and the next sequence is ANA. These are linked together as "identical" because we are giving uncalled bases (Ns) a free pass. The following sequence is ATA. This gets assigned as the same variant because ANA gives it a free pass - even though it does not match the first sequence AAA.

Compiling variants was originally set up this way because it made it faster to pass through the TN93 output once - but it looks like we're going to have to read the entire contents of this output into memory and then check every new genome against every member of a putative cluster before it gets assigned.

ArtPoon commented 4 years ago

A potential fix is to run networkx.find_cliques(G) on each connected component. A member of a clique is connected to every other member (hence, identical in the sense of TN93 below the threshold of 1e-09). Run time does not look promising, though.

ArtPoon commented 4 years ago

find_cliques is taking too long. Better to check for clique membership during processing in variants.py. Writing potential fix in dev branch now.

ArtPoon commented 4 years ago

Proposed fix is to replace the following code: https://github.com/PoonLab/covizu/blob/b2843aa7566e73d81f79383d44b5b0773f1d0436/scripts/variants.py#L44-L87 https://github.com/PoonLab/covizu/blob/b2843aa7566e73d81f79383d44b5b0773f1d0436/scripts/variants.py#L102

with this:

    # filter edges that are below threshold distance
    stars = {}
    nodes = set()
    for line in handle:
        id1, id2, dist = line.strip().split(',')
        dist = float(dist)
        for node in [id1, id2]:
            if node not in nodes:
                nodes.update({nodes})

        if dist < 1e-09:
            if id1 not in stars:
                stars.update({id1: set()})
            stars[id1].update({id2})

            if id2 not in stars:
                stars.update({id2: set()})
            stars[id2].update({id1})

    handle.close()

    # assemble cliques from stars
    cliques = []
    for ego, alters in stars.items():
        clique = alters.union({ego})
        for alter in alters:
            # alters of alter must be entirely contained within clique, minus self
            if clique.difference(stars[alter]) != {alter}:
                clique.remove(alter)
        cliques.append(clique)

The lower loop is processing about 100 nodes (sequences) every 10 seconds.

ArtPoon commented 4 years ago

networkx.find_cliques does not give us what we want:

>>> cliques = list(nx.find_cliques(G))
>>> len(cliques)
3237
>>> foo = [len(clique) for clique in cliques]
>>> sum(foo)
226944

ArtPoon commented 4 years ago

Moving code in progress to new branch issue96

ArtPoon commented 4 years ago

I think the problem of computing time will be helped significantly by being more aggressive about trimming (#76) and filtering genomes with excessive ambiguity.

ArtPoon commented 4 years ago

Working with the HB-01 variant cluster (13106 sequences). Trimmed the alignment to remove the first 53 nucleotides and the last 68 nt. Filtered all sequences with >2% uncalled bases (N, ? or -). Resulting alignment has 11379 sequences.

Even when operating on the smaller intersection graph, find_cliques is prohibitively slow:

>>> uniques = list(set([frozenset(c) for c in cliques]))
>>> len(uniques)
2857
>>> len(cliques)
10836
>>> ig = nx.Graph()
>>> for i in range(len(uniques)):
...     cliq1 = uniques[i]
...     # node is weighted by number of sequences
...     ig.add_node(i, weight=len(cliq1))
...     for j in range(i, len(uniques)):
...         cliq2 = uniques[j]
...         if len(cliq1.intersection(cliq2)) == 0:
...             ig.add_edge(i, j)
... 
>>> len(ig)
2857
>>> max_weight = 0
>>> max_sg = None
>>> for igc in nx.find_cliques(ig):
...     sg = ig.subgraph(igc)
...     weights = nx.get_node_attributes(sg, 'weight').values()
...     total_weight = sum(list(weights))
...     if total_weight > max_weight:
...         max_weight = total_weight
...         max_sg = sg
...         print(max_weight)
... 
3413
3414
3417
3419
3422
3423
3426
3428
3430
3432
3434
3455
3456
3459
3461
3464
3465
3468
3470
3472
3474
3476
3477  # continued to run overnight with no additional outputs

ArtPoon commented 4 years ago

To summarize this proposed method:

The TN93 distances below 1e-9 define a graph of sequence identity, but ambiguous bases cause connected components of this graph to be much larger than they should be.
For each connected component, I would like to extract all maximal cliques - this is the most time-consuming step, and I don't think the stars-based method implemented above is correct.
A graph can be constructed where each node represents a maximal clique and an edge indicates that there is no intersection between the respective cliques. We annotate each node with the number of sequences contained in the corresponding clique.
A maximal clique in this intersection graph (actually the complement of an intersection graph) represents a partition where each node represents a variant (clique of identical sequences). We wish to find the maximal clique that covers the largest total number of sequences (based on the node attributes). This is also a computationally expensive step.

ArtPoon commented 4 years ago

An alternative proposal for a method:

Order sequences by number of ambiguous base calls (descending).
Initialize the graph with one node representing the sequence with the fewer ambiguities (use what method to tie break?)
Take the next sequence. If it is "identical" to the first, join it with an edge. Otherwise, leave it unconnected.
For every subsequent sequence:
- Link it to an existing connected component only if it is "identical" to every member of that component AND it is not identical to any node in any other component.
- If it is not "identical" to any node in any connected component, use it to create a new component.
- If it is "identical" to a node in more than one connected component, reject it as too ambiguous.

I have no idea if this will work.

ArtPoon commented 4 years ago

This is a graph where each node represents a genome in the HB-01 cluster, and each edge indicates that the genomes are "identical" (zero TN93 distance). This was generated on the trimmed and filtered HB-01 sequences.

There are several cliques that correspond to well-defined variants. The enormous component in the middle has several nodes with very few edges - these are unrelated genomes that are bridged into this main component by genomes with large numbers of ambiguous bases.

ArtPoon commented 4 years ago

Let's see what happens if we run a community detection algorithm on that main component. I don't think there is going to be a deterministic procedure to partition this component into multiple clusters otherwise.

ArtPoon commented 4 years ago

>>> tn93_file = 'filtered.tn93.csv'
>>> handle = open(tn93_file)
>>> next(handle)
'ID1,ID2,Distance\n'
>>> import networkx as nx
>>> G = nx.Graph()
>>> for line in handle:
...   id1, id2, dist = line.strip().split(',')
...   if float(dist) < 1e-09:
...     G.add_edge(id1, id2)
>>> components = list(nx.connected_components(G))
>>> len(components)
217
>>> [len(c) for c in components]
[9596, 2, 23, 41, 37, 28, 4, 8, 3, 5, 29, 8, 2, 2, 12, 3, 2, 10, 2, 2, 4, 35, 2, 53, 2, 4, 3, 17, 13, 21, 4, 9, 3, 2, 4, 26, 24, 4, 2, 4, 16, 2, 2, 2, 6, 8, 9, 3, 6, 10, 3, 3, 6, 17, 3, 4, 2, 3, 30, 2, 9, 2, 2, 2, 2, 2, 2, 2, 2, 4, 5, 5, 16, 2, 5, 4, 3, 9, 6, 5, 5, 2, 2, 7, 2, 2, 2, 2, 4, 4, 8, 6, 2, 2, 2, 12, 2, 3, 2, 2, 2, 2, 2, 5, 2, 2, 5, 5, 4, 3, 4, 21, 11, 7, 4, 2, 3, 3, 3, 2, 2, 9, 2, 4, 2, 2, 2, 5, 2, 2, 2, 2, 6, 3, 2, 4, 2, 12, 33, 2, 5, 3, 4, 2, 5, 2, 2, 2, 2, 2, 4, 6, 3, 5, 2, 10, 2, 4, 12, 2, 2, 6, 14, 2, 2, 3, 6, 3, 2, 2, 4, 2, 3, 3, 2, 2, 2, 3, 14, 2, 20, 2, 2, 6, 3, 2, 7, 2, 2, 16, 5, 2, 2, 2, 2, 3, 2, 2, 4, 2, 2, 4, 2, 3, 3, 2, 2, 4, 3, 3, 2, 2, 2, 2, 4, 2, 2]
>>> component = components[0]
>>> sg = G.subgraph(component)
>>> from networkx.algorithms.community import greedy_modularity_communities
>>> communities = list(greedy_modularity_communities(sg))
>>> len(communities)
148
>>> [len(c) for c in communities]
[1662, 1549, 1337, 470, 378, 266, 244, 228, 211, 202, 200, 168, 133, 113, 113, 105, 98, 94, 91, 90, 87, 69, 62, 59, 55, 50, 46, 44, 42, 40, 39, 37, 35, 35, 34, 32, 31, 30, 29, 29, 27, 27, 27, 27, 27, 26, 23, 21, 20, 20, 20, 20, 20, 19, 19, 19, 18, 18, 17, 15, 14, 13, 13, 13, 12, 12, 12, 11, 11, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2]

jts commented 4 years ago

I haven't fully thought this through but could you partition the sequences column-by-column using an MSA? Start with a single partition containing all sequences, then split it into two based on the first variant column (assigning sequences with Ns to both splits) then go to the second variant column, split again and so on. At the end you could remove sequences assigned to multiple partitions. I think at the end you'd have groups of identical sequences modulo ambiguous bases

ArtPoon commented 4 years ago

Thanks for the idea @jts - I'm not sure it will work cleanly in the presence of pervasive uncalled bases but if the community detection algorithm doesn't work out, then I may give this a try!

ArtPoon commented 4 years ago

Results from modularity community detection:

ArtPoon commented 4 years ago

Testing on branch issue96

art@orolo:~/git/covizu$ python3 scripts/variants.py 
importing TN93 distances
built graph with 26914 nodes
Modularity clustering...
Partitioning component of size 14205 into 197 communities
Partitioning component of size 526 into 11 communities
Partitioning component of size 13 into 3 communities
Partitioning component of size 63 into 3 communities
Partitioning component of size 26 into 2 communities
Partitioning component of size 38 into 2 communities
Partitioning component of size 51 into 3 communities
Partitioning component of size 11 into 3 communities
Partitioning component of size 23 into 2 communities
Partitioned graph from 3066 to 3283 components

Note that the version of gisaid-filtered.fa I was using for this test contained 40611 genomes (roughly 5000 records removed by filtering criteria). Restricting the TN93 output by setting -t to 0.0001 causes unique genomes with no zero-distance neighbours to be excluded from this graph calculation (roughly 12,000 genomes), as well as any genome that is not within about 3nt of any other genome in the database (about 2,000).

The number of genomes per variant is no longer skewed to one variant with over 10,000 members:

I've run through the pipeline and evaluated the front-end visualizations, and I'm satisified enough to merge issue96 branch into dev: Screenshot from 2020-07-06 22-59-01

PoonLab / covizu

Wuhan HBCDC-HB-01 variant class is unusually large - check composition #96