dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
742 stars 295 forks source link

[From_CTB] fasta descriptions confound extract-partitions.py #175

Closed RamRS closed 9 years ago

RamRS commented 10 years ago

Transferring ctb#11 opened on 08-May-2012

I got this error trying to follow the partitioning instructions:

python /Users/jmeppley/tmp/tools/diginorm/khmer/scripts/extract-partitions.py all_pe.p32 all_pe.fasta.part

---
reading partitioned files: ['all_pe.fasta.part']
outputting to files named "all_pe.p32.groupN.fa"
min reads to keep a partition: 5
max size of a group file: 1000000
partition size distribution will go to all_pe.p32.dist

---
Traceback (most recent call last):
  File "/Users/jmeppley/tmp/tools/diginorm/khmer/scripts/extract-partitions.py", line 177, in <module>
    main()
  File "/Users/jmeppley/tmp/tools/diginorm/khmer/scripts/extract-partitions.py", line 81, in main
    for n, name, pid, seq in read_partition_file(filename):
  File "/Users/jmeppley/tmp/tools/diginorm/khmer/scripts/extract-partitions.py", line 26, in read_partition_file
    name, partition_id = name.rsplit('\t', 1)
ValueError: need more than 1 value to unpack

It looks like the code that created the partition-labelled fasta did not account for descriptions in the fasta header. Because the first line in my partition file looks like this:

>M00180:26:000000000-A0JVV:1:1:14959:1511.1 1:N:0:0 2

But the extract-partitions script is trying to parse the partition number out of the name.

I made this change to fix the problem:

diff --git a/scripts/extract-partitions.py b/scripts/extract-partitions.py
index 3c00538..9107791 100755
--- a/scripts/extract-partitions.py
+++ b/scripts/extract-partitions.py
@@ -23,7 +23,10 @@ DEFAULT_THRESHOLD=5
 def read_partition_file(filename):
     for n, record in enumerate(screed.open(filename)):
         name = record.name
-        name, partition_id = name.rsplit('\t', 1)
+        if '\t' in name:
+            name, partition_id = name.rsplit('\t', 1)
+        else:
+            description, partition_id = record.description.rsplit('\t',1)
         yield n, name, int(partition_id), record.sequence

 ###
RamRS commented 10 years ago

@mr-c @ctb Does this have anything to do with #68 or #46 ?

mr-c commented 10 years ago

@RamRS nope

devbioinfoguy commented 9 years ago

I am going to try to resolve this issue

lexnederbragt commented 9 years ago

I am going to write a test for this case - it seems to be fixed but I'll confirm that and add the test. I'm being coached by @mr-c who is visiting our lab :-)

mr-c commented 9 years ago

Working backwards @lexnederbragt and I are unable to reproduce this issue with any version of khmer released during my employment (0.5...1.3)

mr-c commented 9 years ago

And I can't find a download for an older version of khmer than v0.5. Seeing as that was released a year and a half after this issue was originally reported I am going to mark this issue as fixed once we have merged Lex's test.

ctb commented 9 years ago

Well, its probably in version control somewhere. And it might (likely) depend on screed. But if it�s fixed it�s fixed :)

mr-c commented 9 years ago

@ctb Right-o, I didn't try versions of screed older than v0.7.1

mr-c commented 9 years ago

Tested in #979

devbioinfoguy commented 9 years ago

(thumbs-up)

Cheers, B

On Apr 23, 2015, at 8:40 AM, Michael R. Crusoe notifications@github.com wrote:

Tested in #979 https://github.com/ged-lab/khmer/pull/979 — Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/175#issuecomment-95573032.


Bob Freeman, Ph.D. Research Computing Facilitator XSEDE Campus Champion FAS Research Computing Harvard University 38 Oxford Street, Rm 117 Cambridge, MA 02138

617/495.8824, vox

@DevBioInfoGuy About: http://bit.ly/1m8n0se