contigs with Ns - Githubissues

G-Brennan commented 7 years ago

Can't get fasta_to_features.py to run on contigs containing N's. Is there a workaround for this?

jvollme commented 7 years ago

Hi g-Brennan. I recently ran into the same problem, and I decided to work-around this by modifying fasta_to_features.py, so that it simply skips (--> ignores) kmers which contain "N"s.

The Problem seems to be, that the script creates a dictionary of all possible kmers based on the standard nucleotides "A", "G", "C" & "T", and then tries to add the kmers it finds on the contig to this dictionary. Since the dictionary does not contain any keys with "N"s by definition, all "N"s automatically lead to an "KeyError"

I think ignoring all "N"-containing kmers is probably the best way to go (otherwise long stretches of "Ns" would create artificial similarities between contigs), but I'm not sure if others might not disagree (therefore I am not issuing a pull request to the developers right now).

If you want to include my modifications as well you just have to add two lines in front of line "40" (which is part of function "generate_features_from_fasta):

old code:

37    for i,seq in enumerate(seqs):
38        contigs_id.append(seq.id)
39        for kmer_tuple in window(seq.seq.tostring().upper(),kmer_len):
40            contigs[i,kmer_dict["".join(kmer_tuple)]] += 1 # <--add lines before THIS line

new code (indentation is important):

37    for i,seq in enumerate(seqs):
38        contigs_id.append(seq.id)
39        for kmer_tuple in window(seq.seq.tostring().upper(),kmer_len):
40            if "N" in kmer_tuple: #added as a workaround to skip kmers containing "N"s
41                continue #added as a workaround to skip kmers containing "N"s
42            contigs[i,kmer_dict["".join(kmer_tuple)]] += 1

alneberg commented 7 years ago

I agree with @jvollme that this is a good workaround if you have contigs with N:s.

BinPro / CONCOCT

contigs with Ns #171