cogent3 / Cogent3Workshop

Materials for the Phylomania workshop
BSD 3-Clause "New" or "Revised" License
8 stars 4 forks source link

LO - 4 - getting data related to a publication #37

Open GavinHuttley opened 1 year ago

GavinHuttley commented 1 year ago

Write content for a wiki page that sketches the motivation, providing links to:

Do all the above as a single comment on this issue. Also make a presentation (10') on this.

@YapengLang write text in a comment on this issue, including screen grabs of the relevant papers so

YapengLang commented 1 year ago

sketches for the coming wiki page:

In this LO, we will take a close look at the data obtained from a publication. The first two sections will introduce the background behind data wrangling and the issue of unexpected file format. All these experiences will then be secured in the last section by using Cogent3 Apps.

Aims of the data sampling

Parsing the alignments files

Demo for parsing bad Phylip by creating a Cogent3 App

To achieve the workflow in section 2, we need a customised function to load the bad Phylip formats. After that, the phylogenetic models can be estimated on all alignments sequentially by the feature of the Cogent3 App.

An instance of a study is here: phy_data.zip

from cogent3 import open_data_store, get_app

in_dstore = open_data_store("data/phy_data.zip", suffix="phy", mode="r") # change "data/phy_data.zip" to your data path 

out_dstore = open_data_store("data/outdb.sqlitedb", mode="w") # "data/outdb.sqlitedb" where your data base for output are 

# the way to construct the prior trees for each locus
dist_cal = get_app("fast_slow_dist", fast_calc="paralinear", moltype="DNA")
est_tree = get_app("quick_tree", drop_invalid=False)
calc_tree = dist_cal + est_tree

loader = load_bad_phylip() # your customised cogent3 app
model = get_app("model", "GTR", tree_func = calc_tree) # the phylogenetic modeller 
writer = get_app("write_db", data_store=out_dstore) # the function to write estiamtions into data base. 

# construct the whole process of the phylogenetic analysis 
process= loader + model + writer
process.apply_to(ds[:5]) # fine to change the number it applied to 
fredjaya commented 1 year ago

Main comments

Great wiki structure and content @YapengLang, I think it flows really well for a workshop!

It would be good to limit repeating content across the presentation and tutorial. In the 10min presentation:

Then for the workshop/wiki:

Phylogenetic inference is restricted by the divergence of sequences. Empirical measurement of divergence (or so-called site-saturation) has been developed to track the limit of inference, which was established on the global nucleotide frequencies for a given alignment and hence model-free. In my study, I want to investigate the link between the measure of site-saturation and my novel model-based phylogenetic measure of inference limit.

The focus of this paragraph can be shifted from (limits due to) saturation and to the data requirements for good phylogenetic inference.

The reason for this is that it caters to the participants (phylogeneticists) the limit of saturation needs to be explained in more detail, but goes beyond the scope of the workshop. Focusing on discussing the issues from divergent sequences would be ideal.

Things that could be mentioned include why phylogenetic inference is restricted by divergent sequences e.g. hard to align, might bias inference if highly saturated and unaccounted for etc.

Other comments

alignments of homologies for different genes (including the priori trees constructed by the distance method) - for my novel measure of inference limit

  • Needs clarity - were these alignments of different, homologous genes?

Each data set is a study for a species family, which consists of various alignments per locus.

  • "species" not needed, perhaps change to "family-level data sets"

The difficulty when parsing the data: some Phylip formats did not follow the strict Phylip definition.

Demo for parsing bad Phylip by Cogent3 App

  • "by creating a Cogent3 app"