LO - 3 - Identifying and dealing with data issues

KatherineCaley commented 1 year ago

inconsistent meta-data (data wrangling REFSOIL GenBank files)
- demonstrate using annotation_db
- explore using dotplots (e.g., 2 sequences that claim to be the same, dotplot to see if this is the case, ungapped smith waterman score -> p-value)
File formats issues
- Duchene phylip formats, solving using bad_phylip app (Gavin to upload this app)
- extremely long fasta sequence labels (e.g. making sure you can collate genomes from one species)

GavinHuttley commented 1 year ago

the bad phylip loader app

import re
from collections import defaultdict
from cogent3.app.composable import define_app, LOADER
from cogent3.app import typing as c3types
from cogent3 import make_aligned_seqs, open_data_store

_wspace = re.compile(r"\s+")

@define_app(app_type=LOADER)
def load_bad_phylip(path: c3types.IdentifierType, moltype="dna") -> c3types.AlignedSeqsType:
    data = path.read()  # interim
    data = data.splitlines()
    num_seqs, seq_len = [int(e) for e in data[0].split()]
    labels = []
    seqs = defaultdict(list)
    seq_num = 0
    getting_labels = True
    for line in data[1:]:
        line = line.strip()
        if not line:
            continue

        if getting_labels:
            label, line = _wspace.split(line, maxsplit=1)
            labels.append(label)
            if len(labels) == num_seqs:
                getting_labels = False
        else:
            label = labels[seq_num]
            seq_num += 1
            if seq_num == num_seqs:
                seq_num = 0

        seq = _wspace.sub("", line)
        seqs[label].append(seq)

    seqs = {l: "".join(s) for l, s in seqs.items()}
    lengths = {len(s) for s in seqs.values()}
    assert lengths == {seq_len}
    return make_aligned_seqs(data=seqs, moltype=moltype)

if __name__ == "__main__":    
    # works on Gavin's machine!
    dstore = open_data_store("~/Desktop/Inbox", suffix="phy", mode="r")
    loader = load_bad_phylip()
    # just doing one file
    print(repr(loader(dstore[0])))

and some sample data locus-0.phy.zip

GavinHuttley commented 1 year ago

@fredjaya can you find examples of fasta files with overloaded sequence labels and show how to use the label_to_name argument to make_<un>aligned_seqs() or load_<un>aligned_seqs() functions docs are here.

Worth noting it's more difficult for a sequence from GenBank since the data is embedded within the features and doing a sequence rename can break the relationship to features in the annotation_db. (Discuss this with Kath.)

fredjaya commented 1 year ago

Will update this message as I come across new examples:

fredjaya commented 1 year ago

Example 1

## Data - Using sequence [EU244602.1](https://www.ncbi.nlm.nih.gov/nuccore/EU244602.1?report=fasta) - Downloaded to file `data/EU244602.1.fa` ## Usage Loading in an unaligned sequence, with and without removing all label characters after the first whitespace: ```python from cogent3 import load_unaligned_seqs fasta_file = "data/EU244602.1.fa" unstripped_seq = load_unaligned_seqs(fasta_file, moltype="dna") stripped_seq = load_unaligned_seqs( fasta_file, moltype="dna", label_to_name=lambda x: x.split()[0]) ) print(unstripped_seq) print(stripped_seq) ``` >\>EU244602.1 Cornechiniscus lobatus from Egypt cytochrome oxidase subunit I-like (COI) gene, partial sequence; mitochondrial GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA > >\>EU244602.1 GGTCAACAAATCATAAAGATATTGGCACTTTGTATTTTATTTTTGGGTTCTGATCTGCCT CTGTCGGCTCTAGATTTAGACTTATTATTCGAACAGAATTGTCTCAACCTGGTATTTTTC TTGGGGATGAGCATTTATATAATGTTTTGGTTACTTCCCACGCTTTAGTTATAATTTTCT TTATAGTTATGCCAATTTTAATTGGAGGTTTTGGTAATTGATTAGTGCCCCTTATAATTG GGGCTCCGGATATGGCGTTCCCTCGGATGAATAATTTGAGTTTTTGACTTTTACCACCTG CTCTCATGCTACTTTTAATATCTTCTAATACTTATGCGGGGGCCGGAACCGGCTGGACTC TTTATCCACCTCTATCAAGCTACTCAGGCCATTCTAATGTGACTGTTGATATAGCTATTT TTTCTTTGCATATTGCCGGAGCTTCTTCTATTTTGGGTGCTATTAATTTTATTACTACTA TTTTTAACATGCGCTCTTTATCTCTGAGTCTTGAAAATATAAGTTTATTTGTTTGATCTG TTTTGATTACTGCCTTTTTACTTCTTTTTTCTCTACCTGTTTTAGCAGGTGGTATTACCA TACTTTTAATAGATCGTAATTTTAATAGCTCATTTTTTGATCCTTCCGGGGGTGGTGATG CCAATTCTTTTTCAGCANGAATTTTGATTTTTTGGTCACCCTGAAGTTTA

Example 2

## Data - [Ensembl compara orthologues of BRCA2 in primates](https://asia.ensembl.org/Homo_sapiens/Gene/Compara_Ortholog?db=core;g=ENSG00000139618;r=13:32315086-32400268) - Selected "Primates (23 species) Humans and other primates" - "Download orthologues" - Download as .fasta ## Usage This time using `load_aligned_seqs`, and labels are delimited by a character instead of whitespace. ```python from cogent3 import load_aligned_seqs fasta_file = "data/Human_BRCA2_orthologues.fa" unstripped_seq = load_aligned_seqs(fasta_file, moltype="dna") stripped_seq = load_aligned_seqs( fasta_file, moltype="dna", label_to_name=lamba x: x.split("_")[0] ) print(unstripped_seq.names) # names only; sequences are long print(stripped_seq.names) ``` >['ENSCSAP00000013938_Csab/1-10260', 'ENSOGAP00000009477_Ogar/1-10218', 'ENSCATP00000015406_Caty/1-10257', 'ENSNLEP00000001277_Nleu/1-10257', 'ENSMNEP00000025034_Mnem/1-10278', 'ENSPCOP00000008219_Pcoq/1-10257', 'ENSPSMP00000026900_Psim/1-10317', 'ENSANAP00000015411_Anan/1-10212' , 'ENSGGOP00000027226_Ggor/1-10248', 'ENSPTRP00000009812_Ptro/1-10254', 'ENSRROP00000030947_Rrox/1-10254', 'ENSMICP00000044704_Mmur/1-8847', 'ENSRBIP00000000297_Rbie/1-1158', 'ENSPPAP00000000836_Ppan/1-10251', 'ENSMLEP00000025266_Mleu/1-11025', 'ENSTSYP00000000441_Csyr/1-10173', 'ENSMMUP00000009432_Mmul/1-10131', 'ENSMFAP00000029418_Mfas/1-10290', 'ENSCJAP00000093240_Cjac/1-9963', 'ENSPPYP00000005997_Pabe/1-10245', 'ENSPANP00000022490_Panu/1-10260', 'ENSP00000369497_Hsap/1-10254', 'ENSCCAP00000031676_Cimi/1-10191'] > >['ENSCSAP00000013938', 'ENSOGAP00000009477', 'ENSCATP00000015406', 'ENSNLEP00000001277', 'ENSMNEP00000025034', 'ENSPCOP00000008219', 'ENSPSMP00000026900', 'ENSANAP00000015411', 'ENSGGOP00000027226', 'ENSPTRP00000009812', 'ENSRROP00000030947', 'ENSMICP00000044704', 'ENSRBIP00000000297', 'ENSPPAP00000000836', 'ENSMLEP00000025266', 'ENSTSYP00000000441', 'ENSMMUP00000009432', 'ENSMFAP00000029418', 'ENSCJAP00000093240', 'ENSPPYP00000005997', 'ENSPANP00000022490', 'ENSP00000369497', 'ENSCCAP00000031676']

KatherineCaley commented 1 year ago

File formats issues (bad phylip) is ported to issue #37

KatherineCaley commented 1 year ago

data wrangling is in #28

KatherineCaley commented 1 year ago

collapsing this issue into #28

KatherineCaley commented 1 year ago

Closing as contented related to this issue will be described in #40

cogent3 / Cogent3Workshop

LO - 3 - Identifying and dealing with data issues #27