brentp / methylcode

Alignment and Tabulation of BiSulfite Treated Reads
Other
16 stars 7 forks source link

pyfasta extract KeyError #10

Closed chc284 closed 8 years ago

chc284 commented 8 years ago

Hi there,

I am trying to run pyfasta as follows:

pyfasta extract --header --fasta 081315KT27F_pr.fasta 081315KT27F_pr_Chlorella_BG11_ids_test.txt > no_space.txt My header lines looked like this: 1::M01522:27:000000000-AHFRY:1:2107:16064:6322 1:N:0:7 I received the following error: Traceback (most recent call last): File "/usr/bin/pyfasta", line 9, in load_entry_point('pyfasta==0.5.2', 'console_scripts', 'pyfasta')() File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 38, in main globals()action File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 134, in extract seq = f[seqname] File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/fasta.py", line 128, in getitem c = self.index[i] KeyError: '>3::M01522:27:000000000-AHFRY:1:2112:23391:15014 1:N:0:7'

I then tried just looking for one of the ids in the ids list: pyfasta extract --header --fasta ~/Documents/Data/081315KT27F_pr.fasta 20::M01522:27:000000000-AHFRY:1:2106:17777:11432 1:N:0:7 subset.fa Traceback (most recent call last): File "/usr/bin/pyfasta", line 9, in load_entry_point('pyfasta==0.5.2', 'console_scripts', 'pyfasta')() File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 38, in main globals()action File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 134, in extract seq = f[seqname] File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/fasta.py", line 128, in getitem c = self.index[i] KeyError: '20::M01522:27:000000000-AHFRY:1:2106:17777:11432'

I removed whitespace from header lines in both the ids list and fasta file, so they look like this:

4::M01522:27:000000000-AHFRY:1:2103:12609:208751:N:0:7

... but I'm still getting [chc@pmpc1451 test]$ pyfasta extract --header --fasta 081315KT27F_pr.fasta 081315KT27F_pr_Chlorella_BG11_ids_test.txt > no_space.txt Traceback (most recent call last): File "/usr/bin/pyfasta", line 9, in load_entry_point('pyfasta==0.5.2', 'console_scripts', 'pyfasta')() File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 38, in main globals()action File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 134, in extract seq = f[seqname] File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/fasta.py", line 128, in getitem c = self.index[i] KeyError: '081315KT27F_pr_Chlorella_BG11_ids_test.txt'

Thanks for any help in advance!

brentp commented 8 years ago

please use pyfaidx: https://github.com/mdshw5/pyfaidx

chc284 commented 8 years ago

Apologies but I've limited experience of Python and was wanting to run from the command line. I can't see an option to give a text file of sequence identifiers to run as a query against a fasta file? There are 266,799 sequences that I want to extract.

brentp commented 8 years ago

instead of 3::M01522:27:000000000-AHFRY:1:2112:23391:15014 1:N:0:7 as the key, use 3::M01522:27:000000000-AHFRY:1:2112:23391:15014

brentp commented 8 years ago

@mdshw5 does pyfaidx have a command-line utility to extract by name?

chc284 commented 8 years ago

I've just successfully extracted my list of sequences by using the --regex option on the command line. Thank you for your advice. My command was as follows:

faidx 081315KT27F_pr.fasta --regex "\b[123]\b::" -o 081315KT27F_pr_ChlorellaBG11.fasta

Where 081315KT27F_pr.fasta is the file containing all my sequencing reads, the parentheses define my regex search terms and 081315KT27F_pr_ChlorellaBG11.fasta is my defined output file containing just the sequences that matched the regex search term.

mdshw5 commented 8 years ago

@mdshw5 does pyfaidx have a command-line utility to extract by name?

@brentp yes, it does: https://github.com/mdshw5/pyfaidx#cli-script-faidx I should probably add a link at the top of the README since it's gotten sort of buried at the end. @chc284: I'm glad you found a solution using the faidx script. Nice use of the --regex option as well - I'm glad to see it was worthwhile to implement.