Closed chc284 closed 8 years ago
please use pyfaidx: https://github.com/mdshw5/pyfaidx
Apologies but I've limited experience of Python and was wanting to run from the command line. I can't see an option to give a text file of sequence identifiers to run as a query against a fasta file? There are 266,799 sequences that I want to extract.
instead of 3::M01522:27:000000000-AHFRY:1:2112:23391:15014 1:N:0:7
as the key, use 3::M01522:27:000000000-AHFRY:1:2112:23391:15014
@mdshw5 does pyfaidx have a command-line utility to extract by name?
I've just successfully extracted my list of sequences by using the --regex option on the command line. Thank you for your advice. My command was as follows:
faidx 081315KT27F_pr.fasta --regex "\b[123]\b::" -o 081315KT27F_pr_ChlorellaBG11.fasta
Where 081315KT27F_pr.fasta is the file containing all my sequencing reads, the parentheses define my regex search terms and 081315KT27F_pr_ChlorellaBG11.fasta is my defined output file containing just the sequences that matched the regex search term.
@mdshw5 does pyfaidx have a command-line utility to extract by name?
@brentp yes, it does: https://github.com/mdshw5/pyfaidx#cli-script-faidx
I should probably add a link at the top of the README since it's gotten sort of buried at the end. @chc284: I'm glad you found a solution using the faidx
script. Nice use of the --regex
option as well - I'm glad to see it was worthwhile to implement.
Hi there,
I am trying to run pyfasta as follows:
I then tried just looking for one of the ids in the ids list: pyfasta extract --header --fasta ~/Documents/Data/081315KT27F_pr.fasta 20::M01522:27:000000000-AHFRY:1:2106:17777:11432 1:N:0:7 subset.fa Traceback (most recent call last): File "/usr/bin/pyfasta", line 9, in
load_entry_point('pyfasta==0.5.2', 'console_scripts', 'pyfasta')()
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 38, in main
globals()action
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 134, in extract
seq = f[seqname]
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: '20::M01522:27:000000000-AHFRY:1:2106:17777:11432'
I removed whitespace from header lines in both the ids list and fasta file, so they look like this:
... but I'm still getting [chc@pmpc1451 test]$ pyfasta extract --header --fasta 081315KT27F_pr.fasta 081315KT27F_pr_Chlorella_BG11_ids_test.txt > no_space.txt Traceback (most recent call last): File "/usr/bin/pyfasta", line 9, in
load_entry_point('pyfasta==0.5.2', 'console_scripts', 'pyfasta')()
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 38, in main
globals()action
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/init.py", line 134, in extract
seq = f[seqname]
File "/usr/lib/python2.7/site-packages/pyfasta-0.5.2-py2.7.egg/pyfasta/fasta.py", line 128, in getitem
c = self.index[i]
KeyError: '081315KT27F_pr_Chlorella_BG11_ids_test.txt'
Thanks for any help in advance!