Weird behavior of CutAdapt -a/-g/-b option

GoogleCodeExporter commented 9 years ago

Hi, I tried to testing the performance of CutAdapt.  So I simulated 1M reads 
with 2 adapters randomly added to either 3' or 5' end.  I also randomized the 
length of adapter sequences to be added to the read.  

As a summary, I generated 39957 contaminated reads, half of them contaminated 
with adapter 1 and the other half contaminated with adapter 2.  The size of 
adapter 1 is 25 bp, the size of adapter 2 is 33 bp. 

Then I ran CutAdapt 3 times using the following commands:

CutAdapt -b adapter_1 -b adapter_2 contamined.reads
CutAdapt -a adapter_1 -a adapter_2 contamined.reads
CutAdapt -g adapter_1 -g adapter_2 contamined.reads

Here is the histogram of adapter lengths. (For the sake of the issue, I only 
posted the relevant information):

For command: CutAdapt -b adapter_1 -b adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths (5')
length  count
24  375
25  402

Histogram of adapter lengths (3' or within)
length  count
24  400
25  394

=== Adapter 2 ===
Histogram of adapter lengths (5')
length  count
32  277
33  288

Histogram of adapter lengths (3' or within)
length  count
32  301
33  346

For command: CutAdapt -a adapter_1 -a adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths 
length  count
24  400
25  1558

=== Adapter 2 ===
Histogram of adapter lengths
length  count
32  301
33  1526

For command: CutAdapt -g adapter_1 -g adapter_2 contamined.reads
===Adapter 1===
Histogram of adapter lengths 
length  count
24  375
25  1604

=== Adapter 2 ===
Histogram of adapter lengths
length  count
32  277
33  1546

In my simulation, there are 402 reads contaminated with adapter 1 of size 25 bp 
at 5' end, 394 reads contaminated with adapter 1 of size 25 bp at 3' end, 288 
reads contaminated with adapter 2 of size 33 bp at 5' end, 346 reads 
contaminated with adapter 2 with size 33 bp at 3' end. 

Therefore, when I used -b option, the sensitivity and specificity of CutAdapt 
are almost 100%.  But when I used -a or -g option, the sensitivity of CutAdapt 
is still 100%, while the false positive rate of trimming increased 
significantly (from 0 to around 0.1%).

I am really confused by this result.  

Based on a previous post: http://code.google.com/p/cutadapt/issues/detail?id=8, 
I thought the decreased specificity might be caused by the error-tolerate 
mapping in longer sequence contamination.  To test this possibility, I ran 
command:  CutAdapt -a adapter_1 -a adapter_2 -e 0.05 contamined.reads

===Adapter 1===
Histogram of adapter lengths 
length  count
24  400
25  1171

=== Adapter 2 ===
Histogram of adapter lengths
length  count
32  301
33  911

So I did observe a decrease in the false positive rate when I used -e 0.05.  

But why the specificity is "perfect" when I used the -b option?

I am running the test using cutadpat 1.0 version on CentOS.

Best,
Ying

Original issue reported on code.google.com by yingzi.z...@gmail.com on 6 Jul 2012 at 5:23

GoogleCodeExporter commented 9 years ago

By the way, if you need my simulated dataset, drop me a note, and I will post 
it here.

Original comment by yingzi.z...@gmail.com on 6 Jul 2012 at 5:24

GoogleCodeExporter commented 9 years ago

I also noticed that the histogram of adapter lengths is still not accurate.

In my simulation, for command: CutAdapt -b adapter_1 -b adapter_2 
contamined.reads
===Adapter 1===
Histogram of adapter lengths (5')
length  count
9   412
10  468
11  421
12  396
13  410
14  379

Histogram of adapter lengths (3' or within)
length  count
9   409
10  472
11  416
12  392
13  412
14  403

=== Adapter 2 ===
Histogram of adapter lengths (5')
length  count
9   295
10  345
11  288
12  323
13  276
14  304

Histogram of adapter lengths (3' or within)
length  count
9   301
10  351
11  335
12  299
13  299
14  302

Thus, the sum of reads trimmed at 9-14 bp are:
length count
9   1417
10  1636
11  1460
12  1410
13  1397
14  1388

However, when I checked the trimmed reads, I found the number of reads trimmed 
at the specific length is not the same as reported by the histogram:
length count
9   1505
10  1554
11  1453
12  1412
13  1398
14  1386

Is there something I should pay attention to?

Best,
Ying

Original comment by yingzi.z...@gmail.com on 6 Jul 2012 at 5:58

GoogleCodeExporter commented 9 years ago

I’d like to reproduce your experiment. Could you send me the dataset via mail 
or attach it to this issue?

Original comment by marcel.m...@tu-dortmund.de on 9 Jul 2012 at 1:36

GoogleCodeExporter commented 9 years ago

Hi, Marcel,

Since my read file is larger than 10M, I emailed it to you.  Please play around 
with it.

Thank you.

Best,
Ying

Original comment by yingzi.z...@gmail.com on 9 Jul 2012 at 6:55

GoogleCodeExporter commented 9 years ago

Ok, thanks for the mail. I think I've figured it out. I've only looked at the 
comparison of -b vs -a, but the problem is probably the same for -b vs -g.

When you use the -a option to specify a 3' adapter, you're essentially saying 
that the adapter *must* occur *after* the sequence you are interested in. The 
adapter itself and everything following it is trimmed. This is already in the 
README and is probably clear. But, the problem is that in your data, you also 
have reads with simulated full-length 5' adapters, which are in the beginning 
of the read. If you use -a and such a read is found, the entire read is trimmed 
(leaving a read of length zero). In your example above, every time this 
happens, it will also increase the counter for reads that were found at full 
length (25 or 33 in your case).

I suggest that you update to cutadapt 1.1, in which the histogram output has 
changed slightly and should be a little less confusing (although it still is 
not perfect, I admit). For cutadapt 1.1, I get output like this when using your 
command line with the two '-a' adapters:

=== Adapter 1 ===

Adapter 'AATGATACGGCGACCACCGAGATCT', length 25, was trimmed 30712 times.

Lengths of removed sequences
length  count   expected
[...]
17      394     0.0
18      415     0.0
19      407     0.0
20      372     0.0
21      426     0.0
22      369     0.0
23      408     0.0
24      400     0.0
25      394     0.0
>=50    1164    0.0

Note how it now says "lengths of removed sequence" and not "Histogram of 
adapter lengths", which is different from cutadapt 1.0 and more what one 
expects. You'll see immediately that the 394 adapters of length 25 are 
correctly found, but that also in 1164 cases, the adapter was found in the 
beginning of the read, trimming the entire read of 50 bp. The 1164 is a bit too 
high, but this is due to the error rate. If you reduce it to zero, the result 
should be better.

(Note that the histogram is cut off at a certain point and for all lengths 
after this point, only the cumulative counts are shown. While playing around 
with your data, I realized that this is not helpful and I have now changed that 
behavior in the Git version of cutadapt.)

I hope this fully explains the discrepancies you've been seing.

Original comment by marcel.m...@tu-dortmund.de on 9 Jul 2012 at 10:18

GoogleCodeExporter commented 9 years ago

Thank you, Marcel, for this clear explanation.

It also showed me why I saw the removal of an entire read in my result.

Just a side question, do you have any recommendation on using CutAdapt to 
remove adapter sequences?

Obviously, based on my experiment, I will stick to -b option.  If so, what is 
the point to use -a/-g option?

In some sense, I think the -g/-a option is mis-leading, because it still 
searches for the presence of adapter sequences "anywhere" in the read.  
Previously, when I used -a option, I though it would search the adapter 
sequences from the 3' end and stop when reaching the full length of the 
adapter, which means, -a option would only match the expression pattern for the 
25 or 33 bp at the 3' end.  I don't expect it to extend the pattern match to 5' 
end.

On the other hand, why the -a/-g option will treat the full length adapter 
sequences differently than the part sequences?  I mean, I don't think there is 
significant difference between the 23-, 24-, and 25-bp of contamination.  
However, only the 25-bp "contamination" results in over-trimming at 5' end, 
which means reads like this "AATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNN" 
got trimmed, but reads like this 
"ATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNNN" was not trimmed.  So how do 
you define the pattern match for -a/-g option?  Why searching for the 24-bp 
contamination will not extend to 5' end?

Original comment by yingzi.z...@gmail.com on 10 Jul 2012 at 2:39

GoogleCodeExporter commented 9 years ago

> do you have any recommendation on using CutAdapt to remove adapter sequences?

It depends on the real data that you have and what you expect could happen. If 
an adapter can occur only in the 5' end, then using -a makes sense: If it's in 
the beginning of the read, then trimming the read to length zero is the right 
thing to do. That could happen when the 5' and 3' adapters are attached to each 
other without anything between them. Using -b in that setting would be 
problematic as the sequence after the adapter could just be noise or another 
adapter.

> In some sense, I think the -g/-a option is mis-leading, because it still 
searches for the presence of adapter sequences "anywhere" in the read.

I had to name the options somehow -- it's just a mnemonic. The authoritative 
description is in the README. For the -a adapter, it says explicitly that the 
read will be empty if the adapter is found in the beginning.

> So how do you define the pattern match for -a/-g option? Why searching for 
the 24-bp contamination will not extend to 5' end?

For -a, the start of the adapter must be within the read. There are two reasons.

First, the model behind this is that the sequence is "5' adapter - insert - 3' 
adapter". The extreme case is that the insert is empty. Allowing to see a 
proper suffix of the 3' adapter would be something else entirely: We then also 
allow some kind of degradation of the 3' adapter. But then we'd also need to 
allow that we see a degraded adapter when the insert is nonempty. This kind of 
degradation may perhaps happen, I don't know. It could even be allowed in 
cutadapt, but since the model is different, it would have to be implemented as 
a different kind of adapter.

Second, there is a practical reason: If suffixes of a 3' adapter were allowed 
to match in the beginning of the read, then short, random matches, in which 
only the last few bases of the adapter match would lead to the read being 
trimmed to length zero.

If you encounter real data in which the assumptions that cutadapt makes don't 
fit the physical process, I'm eager to learn about that and change or extend 
cutadapt to model that process (and I have done so in the past).

Original comment by marcel.m...@tu-dortmund.de on 10 Jul 2012 at 3:39

GoogleCodeExporter commented 9 years ago

I see.

Thanks again.

Original comment by yingzi.z...@gmail.com on 10 Jul 2012 at 3:55

GoogleCodeExporter commented 9 years ago

I hope it's ok to close the issue now. Please open a new one for other problems 
or e-mail me.

Original comment by marcel.m...@tu-dortmund.de on 10 Jul 2012 at 4:08

Changed state: Done

alexstaj / cutadapt

Weird behavior of CutAdapt -a/-g/-b option #47