davidhwyllie / findNeighbour4

A server delivering large scale, incrementable, bacterial relatedness monitoring
MIT License
3 stars 2 forks source link

Build phylogenetic tress for groups of samples identified by findneighbour4 #109

Open davidhwyllie opened 2 years ago

davidhwyllie commented 2 years ago

Background

fn4 is a relatedness monitoring system which stores SNV distances between reference-mapped sequences.
It also has sample clustering functionality. It would be desirable to have phylogenetic trees of sample clusters stored in findneighbour4. This would allow one application (fn4) to provide SNV and phylogenetic relationships between samples.

Clustering algorithms

Various sample clustering algorithms could be implemented, but at present the algorithm coded group samples based on SNV distances. The algorithms are implemented classes in snvclusters/ma_linkage.py. Clustering itself is a batch process, which is implemented findneighbour4_clustering.py. It is intended that this script runs continuously; it will update clusters as and when new samples are available. Multiple clustering algorithms can be run on a single server, and their contents can be examined using the REST-API (see https://github.com/davidhwyllie/findNeighbour4/blob/master/doc/rest-routes.md). They can also be accessed via the persistence classes in the findn/rdbmsstore.py (for RDBMS) or find\mongoStore.py (for mongodb) modules. These classes have the same API and the right class for the database backend can be instantiated via the Persistence class in findn/persistence.py.
Each cluster is identified by a key, a cluster is represented by json object. There is a convention for the format of key (see code) but this essentially arbitrary; any kind of alphanumeric key will do. The relevant api methods are

Multiple sequence alignments

fn4 also supports the generation and storage of multi-sequence alignments from arbitrary numbers of sequences (MSAs) in which invariant sites are dropped. MSAs are persisted to disc, and mechanisms exists for removing MSAs which are outdated. Each MSA is identified by a key, a cluster is represented by json object. There is a convention for the format of key (see code) but this essentially arbitrary; any kind of alphanumeric key will do.

Because MSA generation can be a little slow for interactive use, at the end of each clustering operation, MSAs are built for each cluster. Existing MSAs are not rebuilt. The code doing this is here

This means that MSAs, minus any invariant bases, are stored for all clusters. This provides all the necessary information to rapidly draw a phylogenetic tree from the clustered samples. They can be accessed in the form of a MSAResult object. MSAResult objects can be read from the database, given a key, by the MSAStore.load() method

Phylogenetic trees

fn4 contains methods to store phylogenetic trees in a database. It is assumed
i) each phylogenetic tree is identified by a key. The key could be the same as the key for a corresponding MSA object (in fact, this would make a lot of sense) but any alphanumeric key will do. ii) the tree is defined as a json object

The API methods for tree storage have the same names as for MSA objects. Each tree is identified by a key, and a tree is represented by json object. There is a convention for the format of key (see code) but this essentially arbitrary; any kind of alphanumeric key will do. The relevant api methods are

Not implemented / properly though through at present

Options

How a framework for generating ML trees from multiple clusters is best written merits thought. Can we just analyse sequentially (yes, probably initially) or do we require a framework for parallelisation (if so, how do we test it and what framework should be used?). Suppose we have access to (a) a cluster (b) a multicpu single machine how should the be best used?

Test set

There are various options suitable for automated testing/CI. The AC587 dataset is one, see https://github.com/davidhwyllie/findNeighbour4/blob/dbe8d15f474ce3642e5916d0587daf19da65abe3/demo/demo_ac587.py

davidhwyllie commented 2 years ago

For a demo:

clustering runs on a schedule. May need to wait a few minutes for clusters to build

can check the clustering progress with something like (see output from startup script for exact file)

tail /data/software/demos/AC587/log/nohup_fn4_clustering_fac93be11a76461136c3f956297a61a8.out


To illustrate access to stored MSAs:

pipenv run python3 findNeighbour4_treebuild.py demos/AC587/config/config.json

davidhwyllie commented 2 years ago

Example output:

msa|M|no_og|f6d4a345a9bf672197e969e6fd251fd8eb711185
<class 'findn.msa.MSAResult'>
                                          aligned_seq  aligned_seq_len   allN  alignN  ...  expected_proportion2  expected_proportion3  expected_proportion4  what_tested
015fd748-2e21-44e7-9937-fd9363e04bc1  GTAACTCGNGCCCGT               15  29072       1  ...              0.000034              0.000034                  None            M
9c0a67cf-3991-4434-bcb0-f5704156f1ee  NMAACNNACGANNGN               15  27154       6  ...              0.000069              0.000069                  None            M
29a7737f-2298-4e57-bb0e-8678c84bedd8  NMNNNNCACTANNAT               15  37217       7  ...              0.000103              0.000103                  None            M

[3 rows x 18 columns]
{'T': 668473, 'A': 666711, 'C': 1261610, 'G': 1257432}

NOTES:
findneighbour4 supports a 'M' character to represent a mixed basecall. These should be converted to 'N's before supplying to a tree drawing program.