dib-lab / kProcessor

kProcessor: kmers processing framework.
https://kprocessor.readthedocs.io
BSD 3-Clause "New" or "Revised" License
11 stars 1 forks source link

names file parsing error #81

Open mr-eyes opened 3 years ago

mr-eyes commented 3 years ago

When the second column holds unusual characters...

Reproduce:

proteome=drosophila.fa
wget --quiet https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000000803/UP000000803_7227.fasta.gz -O ${proteome}.gz
gunzip *gz
import kProcessor as kp

# Generate names file
with open("drosophila.fa", 'r') as FASTA, open("drosophila.fa.names", 'w') as NAMES:
    for line in FASTA:
        if line.startswith('>'):
            line = line[1:] #Remove >
            geneID= line[line.find("GN"):line.find("PE")-1]
            NAMES.write(f"{line.strip()}\t{geneID}\n")

kSize = 7
fasta_file = "drosophila.fa"
chunkSize = 100
names_file = fasta_file + ".names"

KF = kp.kDataFramePHMAP(kp.PROTEIN, kp.protein_hasher, {"kSize":kSize})

cKF = kp.index(KF, fasta_file, chunkSize, names_file)

Sample from the names file

tr|A0A0B4K776|A0A0B4K776_DROME Uncharacterized protein OS=Drosophila melanogaster OX=7227 GN=Dmel\CG43371 PE=4 SV=1 GN=Dmel\CG43371
tr|A0A0B4KF40|A0A0B4KF40_DROME Sulfotransferase 3, isoform C OS=Drosophila melanogaster OX=7227 GN=St3 PE=4 SV=1    GN=St3
tr|A0A0B4KF61|A0A0B4KF61_DROME Synaptotagmin 14, isoform D OS=Drosophila melanogaster OX=7227 GN=Syt14 PE=4 SV=1    GN=Syt14
tr|A1A6Q5|A1A6Q5_DROME Acylphosphatase OS=Drosophila melanogaster OX=7227 GN=Dmel\CG34161 PE=2 SV=1 GN=Dmel\CG34161