Open mr-eyes opened 3 years ago
When the second column holds unusual characters...
proteome=drosophila.fa wget --quiet https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000803/UP000000803_7227.fasta.gz -O ${proteome}.gz gunzip *gz
import kProcessor as kp # Generate names file with open("drosophila.fa", 'r') as FASTA, open("drosophila.fa.names", 'w') as NAMES: for line in FASTA: if line.startswith('>'): line = line[1:] #Remove > geneID= line[line.find("GN"):line.find("PE")-1] NAMES.write(f"{line.strip()}\t{geneID}\n") kSize = 7 fasta_file = "drosophila.fa" chunkSize = 100 names_file = fasta_file + ".names" KF = kp.kDataFramePHMAP(kp.PROTEIN, kp.protein_hasher, {"kSize":kSize}) cKF = kp.index(KF, fasta_file, chunkSize, names_file)
tr|A0A0B4K776|A0A0B4K776_DROME Uncharacterized protein OS=Drosophila melanogaster OX=7227 GN=Dmel\CG43371 PE=4 SV=1 GN=Dmel\CG43371 tr|A0A0B4KF40|A0A0B4KF40_DROME Sulfotransferase 3, isoform C OS=Drosophila melanogaster OX=7227 GN=St3 PE=4 SV=1 GN=St3 tr|A0A0B4KF61|A0A0B4KF61_DROME Synaptotagmin 14, isoform D OS=Drosophila melanogaster OX=7227 GN=Syt14 PE=4 SV=1 GN=Syt14 tr|A1A6Q5|A1A6Q5_DROME Acylphosphatase OS=Drosophila melanogaster OX=7227 GN=Dmel\CG34161 PE=2 SV=1 GN=Dmel\CG34161
When the second column holds unusual characters...
Reproduce:
Sample from the names file