Open G-Brennan opened 7 years ago
Hi g-Brennan. I recently ran into the same problem, and I decided to work-around this by modifying fasta_to_features.py
, so that it simply skips (--> ignores) kmers which contain "N"s.
The Problem seems to be, that the script creates a dictionary of all possible kmers based on the standard nucleotides "A", "G", "C" & "T", and then tries to add the kmers it finds on the contig to this dictionary. Since the dictionary does not contain any keys with "N"s by definition, all "N"s automatically lead to an "KeyError"
I think ignoring all "N"-containing kmers is probably the best way to go (otherwise long stretches of "Ns" would create artificial similarities between contigs), but I'm not sure if others might not disagree (therefore I am not issuing a pull request to the developers right now).
If you want to include my modifications as well you just have to add two lines in front of line "40" (which is part of function "generate_features_from_fasta):
old code:
37 for i,seq in enumerate(seqs):
38 contigs_id.append(seq.id)
39 for kmer_tuple in window(seq.seq.tostring().upper(),kmer_len):
40 contigs[i,kmer_dict["".join(kmer_tuple)]] += 1 # <--add lines before THIS line
new code (indentation is important):
37 for i,seq in enumerate(seqs):
38 contigs_id.append(seq.id)
39 for kmer_tuple in window(seq.seq.tostring().upper(),kmer_len):
40 if "N" in kmer_tuple: #added as a workaround to skip kmers containing "N"s
41 continue #added as a workaround to skip kmers containing "N"s
42 contigs[i,kmer_dict["".join(kmer_tuple)]] += 1
I agree with @jvollme that this is a good workaround if you have contigs with N:s.
Can't get fasta_to_features.py to run on contigs containing N's. Is there a workaround for this?