Closed FFerraro99 closed 2 years ago
I don't have much time to look into it in detail and debug right now, but perhaps it has something to do with your gene names. Maybe try processing your data to get simpler gene labels, here's a suggested command for doing so
sed "s/lcl|//" data | sed "s/ .*//g" > data_relabeled.fasta
If the problems persist, let me know, then I'll look in more detail.
Thank you for your rapid response. For clarification, which data are you suggesting I relable? The cds fasta file?
I don't have much time to look into it in detail and debug right now, but perhaps it has something to do with your gene names. Maybe try processing your data to get simpler gene labels, here's a suggested command for doing so
sed "s/lcl|//" data | sed "s/ .*//g" > data_relabeled.fasta
If the problems persist, let me know, then I'll look in more detail.
Thank you, I ran this last night and it worked. Do you by any chance know a way for this to be done in python
You can write a script to do this, or use BioPython, for example:
import Bio.SeqIO
seqs = []
for rec in Bio.SeqIO.parse("./GCF_902806645.1_cgigas_uk_roslin_v1_cds_from_genomic.fna", "fasta"):
gene_id = rec.id
new_id = gene_id.split("|")[1]
rec.id = new_id
rec.description = ""
seqs.append(rec)
Bio.SeqIO.write(seqs, "seqs.renamed.fasta", "fasta")
Thank you so much, this is of infinite help
I have been trying to use the wgd package for analysis of cds files from mollusc genomes. I have been getting a KeyError, and am not sure how to solve this issue. I the file was from NCBI at https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/905/397/895/GCA_905397895.1_MEDL1/GCA_905397895.1_MEDL1_cds_from_genomic.fna.gz I removed the colons, and obtained an the mcl file, but when moving to the ksd command, but no ks.tsv file.
I have been getting a warning suggesting I raise the max_pairwise parameter, and that 13 largest geen familis were filtered out. Any possible solutions for this problem?