GavinHuttley commented 11 months ago

Know

Technology

- computer setup pages(s), participants to do it before attending) - installing / updating cogent3 (PyPl or GitHub). (Advice from Peter M was to do it via conda, but this won't work on macos if users don't have homebrew + xcode tools.) @ pre-meeting - intro to PyPI and GitHub - intro to pip - intro to conda - explanation of virtual environments (what and why) - how to ask for help (GitHub Discussions) - Raise issues, contribute (Issues, c3dev) - backup get jupyterhub working

LO Understanding experimental design issue

What they need to consider in choosing sequences for study. Reproducible computation -- `scitrack` - different sequence types and relevance for experimental design - different sequence relationship types

LO - Getting data

- Published GenBank ID's (e.g. REFSOIL) - Published (already aligned) data set (Duchene et al example) - Ensembl downloading and Installing

LO - Sampling ensembl

- Downloading - Installing - Data summaries

LO - Identifying and dealing with data issues

- inconsistent meta-data (data wrangling REFSOIL GenBank files) - demonstrate using `annotation_db` - explore using dotplots - File formats issues - Duchene phylip formats, solving using `bad_phylip` app - extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)

LO - sampling sequence classes Ensembl

- sampling homologous sequences - sampling alignments

LO - Alignments

- using cogent3 - quantifying alignment quality - visualisation

LO - Sampling alignments

- selecting by length - codon positions - consistent species presence

LO - Unsolved / Important problems

- alignment quality scores! - pair and multiple

GavinHuttley commented 11 months ago

Notes

We don't want students competing with each other (bandwidth wise) on a wifi network to download large volumes of data. So we will need example "download" configs that allow them to download of a small amount of data. We will need already downloaded larger data sets, and already "installed" larger data sets which they can grab. (Noting here that the "installed" data sets are much smaller than the original downloads.)

GavinHuttley commented 11 months ago

We could reframe this as:

experimental design considerations for methods developers
getting related sequences with particular properties e.g. homology type, level of divergence, minimum length
handling inconsistencies in biological data resources
multiple sequence alignment and QC

cogent3 / Cogent3Workshop

Learning outcomes #3

Notes

13

14

15

16

17

18

19

20

21