Over-counting of unitigs

apredeus commented 4 years ago

Hello John,

When I was reporting another issue (fixed rev-comp problem with unitig-caller), I've realized unitig-counter seems to make erroneous calls if a contig in particular isolate starts with a sequence matching the unitig (but does not in fact contain the entire unitig). This leads to errors in AF estimation, as well as erroneous inclusion of isolates as carrying the variant (and thus possibly adding noise to associations).

For example, I have a 49 bp unitig that is present (according to the unitig-counter) in 2936 isolates. However, when I run

blastn -query unitigs.fa -db all_genomes.fa -outfmt 6 -dust no -soft_masking false -max_target_seqs 5000000

, I get only 2911 100% 49 bp matches. For the rest of the samples, hits look like this:

n3 66 100.00 31 0 0 1 31 31 1 4e-10 58.4

bwa fastmap also confirms this. It seems that if the contig starts with a perfect match (of smaller size), unitig-counter considers the unitig to be present in this isolate.

Unitig-caller after update produces the results identical to those of blastn on every occasion.

Overall, becomes a potentially much bigger problem for longer unitigs; the biggest difference I've seen in my dataset is fourfold (11 real vs 42 reported by unitig-counter).

johnlees commented 4 years ago

Thanks for the detailed investigation and bug report. Just so I'm sure I understand:

Is this problem still happening in v1.0.1, where the rev-comp problem is fixed?
In your example, is it the case that the first 31 bases of the 49bp unitig match the start of a contig sequence, but the final 18 do not? It is then called as present by the counter?

I'm not sure why this would happen, but sounds like it's worth looking into

apredeus commented 4 years ago

Hello John,

yes, the problem with unitig-caller is fully fixed now - the number of isolates it finds perfectly matches that of blast for every unitig out of ~ 500 in my test set. The problem I'm describing is with unitig-counter
yes, exactly - first 31 bp match, the rest don't, but unitig-counter still considers the unitig to be present in this isolate. I think it might have something to do with how compressed DBG is constructed, but I'm guessing here, since I don't know the algorithm.

johnlees commented 4 years ago

Ah, why did I name this packages such similar names!! Got confused between which one I was on here.

I'm not an expert on the cDBG from gatb, but @rchikhi might have an answer if he has time

We are also trying an alternative cDBG option which should be released soon, hopefully that might fix this.

apredeus commented 4 years ago

For now, I'm thinking, one can use the output of unitig-caller to do the associations - I think this would be more accurate, esp. for longer/low AF unitigs.

rchikhi commented 4 years ago

I'm not quite following the issue here. What do you mean by the first 31 bases of the 49bp unitig match the start of a contig, especially, what is a contig in this context? EDIT: the contig of an assembled isolate, I guess.

Then it could be that, since unitigs typically overlap by exactly (k-1) nucleotides, the match you are observing could be the overlapping portion between two (or more) unitigs.

apredeus commented 4 years ago

@rchikhi short-read sequencing assemblies are usually fragmented - so each sample is represented by a group of assembled contigs. When you map the unitig in question to the sample's assembly, it does not map exactly - only 31 bp of it does.

Then it could be that, since unitigs typically overlap by exactly (k-1) nucleotides, the match you are observing could be the overlapping portion between two (or more) unitigs.

I'm not sure what the reason is - but I'm pretty sure the isolate should not count as containing this unitig.

rchikhi commented 4 years ago

@apredeus, under the assumption that you ran unitig-counter with the default k-mer size of 31, then what I said above regarding (k-1)-matches would imply seeing a 30-base match. But since you're seeing 31 bases perfect matches, it clearly means that the flagged unitig for some reason starts with the first kmer of a contig but doesn't continue further. This indeed coud look like a bug in unitig construction.

So I decided to investigate it. I took the 616 genomes from pyseer's tutorial and built a cDBG using unitig-counter, k=31.

Ran the following commands:

ls -1 /mnt/d/work/pyseer_tutorial/*.fa > refs.txt
awk '{printf "ref%d\t%s\n", NR, $0}' < refs.txt > refs_pyseer.txt
./unitig-counter  -strains refs_pyseer.txt
python to_fasta.py

cat /mnt/d/work/pyseer_tutorial/*.fa > all_genomes.fa
python rename_genomes.py
makeblastdb -in all_genomes.renamed.fa  -input_type fasta -dbtype nucl

blastn -query unitigs.fa -db all_genomes.renamed.fa -outfmt 6 -dust no \
       -soft_masking false -max_target_seqs 5000000 > blast_output

python analyze_results.py

with to_fasta.py containing:

g = open("unitigs.fa","w")
i = 0
for line in open("../output/unitigs.txt"):
    utg = line.split()[0]
    g.write(">utg_%d_len_%d\n%s\n" % (i,len(utg),utg))
    i = i+1
g.close()

and rename_genomes.py containing:

d = dict()
for line in open("refs_pyseer.txt"):
    refname, filename = line.split()
    true_refname = '.'.join(open(filename).readline().strip()[1:].split('.')[:-1])
    print(refname,filename,true_refname)
    d[true_refname] = refname

g = open("all_genomes.renamed.fa","w")
for line in open("all_genomes.fa"):
    if line.startswith(">"):
        true_refname = '.'.join(line.strip()[1:].split('.')[:-1])
        suffix = line.strip()[1:].split('.')[-1]
        print(true_refname)
        g.write(">%s.%s\n" % (d[true_refname],suffix))
    else:
        g.write(line)
g.close()

and analyze_results.py containing:

# detects unitigs in the blast output such that a portion of the unitig maps perfectly over 31 bp, and another portion maps perfectly over > 31 bp
prev = None
cur_utg_alen = set()

def analyze(utg_name):
    if 31 in cur_utg_alen and len(cur_utg_alen) > 1:
        print("interesting unitig",utg_name,cur_utg_alen)

for line in open('blast_output'):
    ls = line.split()
    utg_name = ls[0]
    if utg_name != prev:
        if prev is not None:
            analyze(prev)
        prev = utg_name
        cur_utg_alen = set()
    if "len_31" in utg_name: continue
    pident = float(ls[2])
    if pident != 100.0: continue
    length = int(ls[3])
    cur_utg_alen.add(length)

This gave me a list of unitigs that look like the ones you've spotted, @apredeus: 31-bp exact matches but also full-length matches to contigs.

I looked specificallt at one of those unitigs: utg_4996_len_37 (sequence: AAAATCTTGCGCAATAAAGCTCATCTCCATCTCCCGA) has 37 base pairs and BLAST reports the following hits:

utg_4996_len_37 ref38.14        100.000 37      0       0       1       37      54611   54575   2.04e-11        69.4
utg_4996_len_37 ref33.11        100.000 37      0       0       1       37      51596   51560   2.04e-11        69.4
|..]
utg_4996_len_37 ref175.21       100.000 31      0       0       7       37      1       31      4.42e-08        58.4
utg_4996_len_37 ref175.14       100.000 31      0       0       7       37      53530   53500   4.42e-08        58.4

So, full-lengths hits as well as 31-bp hits at the extremities, same as the problem you report. I constructed unitigs using a different method (BCALM2) to check whether this unitig was incorrectly constructed, using these commands:

 ~/bcalm/build/bcalm -in all_genomes.fa -kmer-size 31 -abundance-min 1
  grep utg_4996_len_37 unitigs.fa -A 1
  grep AAAATCTTGCGCAATAAAGCTCATCTCCATCTCCCGA all_genomes.unitigs.fa  -B 1

and it turns out that both unitigs are identical. Thus I'm ruling out an error when constructing the cDBG.

However, @apredeus can you confirm that the unitigs that are affected by the problem always have such problematic matches at the extremities of contigs?

My working hypothesis is that those unitigs are fine but need to be further splitted, specifically, whenever they overlap an extremity of a contig. By construction, a DBG (hence the resulting compacted DBG) is not 'aware' of extremities of reference contigs. A DBG only sees unordered bags of kmers. What can happen is that a unitig is constructed by taking k-mers from an extremity of a contig and continuing to the extremity of another contig (from possibly even another isolate). This is not common but definitely possible, as contigs would sometimes lose the genomic variation that led to them being interrupted in an assembly.

I did make a script a while ago to do exactly this unitig-cutting: https://github.com/GATB/bcalm/blob/master/scripts/pufferize.py So @johnlees, bottom line is that you might want to run that script on the output of any cDBG construction (such as unitig-counter) so that unitigs are cut at extremities of contigs.

johnlees commented 4 years ago

@rchikhi thank you for this incredibly detailed investigation! I really appreciate all the time that you have spent on this issue (and others)

Your answer certainly makes sense. The input is concatenated contigs, so it certainly seems possible that unitigs could span these. As you say it seems like these cases could either be biologically real or artificial – where there is a contig break in one poorly assembled sample may have a contiguous sequence in another assembly which effectively bridges it. This would be down to the mapping rather than cDBG construction.

Great that you already have a solution. Running unitig-caller on the output would be another (likely less efficient) option. I'll add both of these choices to the documentation now

rchikhi commented 4 years ago

Hi @johnless, you're welcome. One more thing. I've noticed that pufferize.py outputs in GFA format which isn't too convenient. And I also noticed that unitig-counter at one point constructs three chimeric unitigs (out of 750k) that contain the same kmer twice, likely due to a small DBG bug. For these reasons I've modified pufferize.py to output in FASTA format and also report these problems instead of crashing :) The new script is https://github.com/GATB/bcalm/blob/master/scripts/split_unitigs.py

bacpop / unitig-counter

Over-counting of unitigs #10