Open GavinHuttley opened 1 year ago
sketches for the coming wiki page:
In this LO, we will take a close look at the data obtained from a publication. The first two sections will introduce the background behind data wrangling and the issue of unexpected file format. All these experiences will then be secured in the last section by using Cogent3 Apps.
the nature of my problem
In my study, the published data sets were required for developing my novel model-based measure for phylogenetic inference limit. Divergent sequences in these published data have been curated to develop the measure of site-saturation (paper URL: https://pubmed.ncbi.nlm.nih.gov/34508605/). It was believed that the statistical test associated with this measure could detect the limit of inference. Therefore, I want to investigate the relationship between my measure of the limit of inference and the site-saturation measure using the same data sets in the paper provided.
[screenshot of paper title and abstract]
the information I needed
The original data sets cited in the above paper are necessary to inspect the behaviours of the published site-saturation measurement and mine. In order to reveal the distribution of site-saturation measurement extensively, I want data sets spanning in a gradient of site-saturation.
[bar chart of site-saturation of data sets]
Each data set consists of alignments of different, homologous genes. To link two measures, the alignment for a single locus and the corresponding site-saturation measure are needed in the files below:
Compressed file of alignments -- for developing the novel model-based measure of the phylogenetic inference limit. In the next section, I will demonstrate how to read this file and tackle the unexpected format indicated by the file extension. Consequently, the phylogenetic models can be estimated based on these alignments.
[a brief workflow of my experiment]
Statistics of site-saturation -- the existing measure of site-saturation. The retrieving will not be covered in this demonstration.
The properties of alignments files
The difficulty when parsing the data Some files did not follow the definition of formats indicated by the file extension. Take the Phylip format as an example:
strict Phylip format definition: [screenshot from the PHYLIP doc]
an example of bad Phylip [screenshot of the duchene's data] 😞😢😭
To achieve the workflow in section 2, we need a customised function to load the bad Phylip formats. After that, the phylogenetic models can be estimated on all alignments sequentially by the feature of the Cogent3 App.
An instance of a study is here: phy_data.zip
[ ] Step1: Write Python function to read relaxed Phylip format with key components:
[ ] Step2: Wrap the function into C3 App, by using the decorator @define_app(app_type=LOADER)
.
The script can be downloaded here: parse_bad_phylip.py.zip
[ ] Step3: Construct the whole phylogenetic analysis with the composable apps.
from cogent3 import open_data_store, get_app
in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path
out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are
# the way to construct the prior trees for each locus
dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA")
est_tree = get_app("quick_tree", drop_invalid=False)
calc_tree = dist_cal + est_tree
loader = load_bad_phylip() # your customised cogent3 app
model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller
writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base.
# construct the whole process of the phylogenetic analysis
process= loader + model + writer
process.apply_to(ds[:5]) # fine to change the number it applied to
Great wiki structure and content @YapengLang, I think it flows really well for a workshop!
It would be good to limit repeating content across the presentation and tutorial. In the 10min presentation:
Then for the workshop/wiki:
Phylogenetic inference is restricted by the divergence of sequences. Empirical measurement of divergence (or so-called site-saturation) has been developed to track the limit of inference, which was established on the global nucleotide frequencies for a given alignment and hence model-free. In my study, I want to investigate the link between the measure of site-saturation and my novel model-based phylogenetic measure of inference limit.
The focus of this paragraph can be shifted from (limits due to) saturation and to the data requirements for good phylogenetic inference.
The reason for this is that it caters to the participants (phylogeneticists) the limit of saturation needs to be explained in more detail, but goes beyond the scope of the workshop. Focusing on discussing the issues from divergent sequences would be ideal.
Things that could be mentioned include why phylogenetic inference is restricted by divergent sequences e.g. hard to align, might bias inference if highly saturated and unaccounted for etc.
alignments of homologies for different genes (including the priori trees constructed by the distance method) - for my novel measure of inference limit
- Needs clarity - were these alignments of different, homologous genes?
Each data set is a study for a species family, which consists of various alignments per locus.
- "species" not needed, perhaps change to "family-level data sets"
The difficulty when parsing the data: some Phylip formats did not follow the strict Phylip definition.
Demo for parsing bad Phylip by Cogent3 App
- "by creating a Cogent3 app"
Write content for a wiki page that sketches the motivation, providing links to:
Do all the above as a single comment on this issue. Also make a presentation (10') on this.
@YapengLang write text in a comment on this issue, including screen grabs of the relevant papers so