arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
83 stars 41 forks source link

KeyError #77

Closed FFerraro99 closed 2 years ago

FFerraro99 commented 2 years ago

I have been trying to use the wgd package for analysis of cds files from mollusc genomes. I have been getting a KeyError, and am not sure how to solve this issue. I the file was from NCBI at https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/397/895/GCA_905397895.1_MEDL1/GCA_905397895.1_MEDL1_cds_from_genomic.fna.gz I removed the colons, and obtained an the mcl file, but when moving to the ksd command, but no ks.tsv file.

/usr/local/lib/python3.8/dist-packages/wgd/phy.py in <dictcomp>(.0=<list_iterator object>)
    137         if node.is_leaf():
    138             node.name = id_map[node.name]
    139             id_map[node.name] = node.name  # add identity map for renamed nodes
    140             # to id_map for line below
    141             pairwise_distances[node.name] = {
--> 142                 id_map[x.name]: node.get_distance(x) for x in t.get_leaves()
        x.name = '155173'
        x = Tree node '155173' (0x7f4c44fcfc4)
    143             }
    144         else:
    145             node.name = n
    146             n += 1

KeyError: '155173'

I have been getting a warning suggesting I raise the max_pairwise parameter, and that 13 largest geen familis were filtered out. Any possible solutions for this problem?

arzwa commented 2 years ago

I don't have much time to look into it in detail and debug right now, but perhaps it has something to do with your gene names. Maybe try processing your data to get simpler gene labels, here's a suggested command for doing so

sed "s/lcl|//" data | sed "s/ .*//g" > data_relabeled.fasta

If the problems persist, let me know, then I'll look in more detail.

FFerraro99 commented 2 years ago

Thank you for your rapid response. For clarification, which data are you suggesting I relable? The cds fasta file?

FFerraro99 commented 2 years ago

I don't have much time to look into it in detail and debug right now, but perhaps it has something to do with your gene names. Maybe try processing your data to get simpler gene labels, here's a suggested command for doing so

sed "s/lcl|//" data | sed "s/ .*//g" > data_relabeled.fasta

If the problems persist, let me know, then I'll look in more detail.

Thank you, I ran this last night and it worked. Do you by any chance know a way for this to be done in python

arzwa commented 2 years ago

You can write a script to do this, or use BioPython, for example:

import Bio.SeqIO
seqs = []
for rec in Bio.SeqIO.parse("./GCF_902806645.1_cgigas_uk_roslin_v1_cds_from_genomic.fna", "fasta"):
    gene_id = rec.id
    new_id = gene_id.split("|")[1]
    rec.id = new_id
    rec.description = ""
    seqs.append(rec)  
Bio.SeqIO.write(seqs, "seqs.renamed.fasta", "fasta")
FFerraro99 commented 2 years ago

Thank you so much, this is of infinite help