Closed GoogleCodeExporter closed 9 years ago
By the way, if you need my simulated dataset, drop me a note, and I will post
it here.
Original comment by yingzi.z...@gmail.com
on 6 Jul 2012 at 5:24
I also noticed that the histogram of adapter lengths is still not accurate.
In my simulation, for command: CutAdapt -b adapter_1 -b adapter_2
contamined.reads
===Adapter 1===
Histogram of adapter lengths (5')
length count
9 412
10 468
11 421
12 396
13 410
14 379
Histogram of adapter lengths (3' or within)
length count
9 409
10 472
11 416
12 392
13 412
14 403
=== Adapter 2 ===
Histogram of adapter lengths (5')
length count
9 295
10 345
11 288
12 323
13 276
14 304
Histogram of adapter lengths (3' or within)
length count
9 301
10 351
11 335
12 299
13 299
14 302
Thus, the sum of reads trimmed at 9-14 bp are:
length count
9 1417
10 1636
11 1460
12 1410
13 1397
14 1388
However, when I checked the trimmed reads, I found the number of reads trimmed
at the specific length is not the same as reported by the histogram:
length count
9 1505
10 1554
11 1453
12 1412
13 1398
14 1386
Is there something I should pay attention to?
Best,
Ying
Original comment by yingzi.z...@gmail.com
on 6 Jul 2012 at 5:58
I’d like to reproduce your experiment. Could you send me the dataset via mail
or attach it to this issue?
Original comment by marcel.m...@tu-dortmund.de
on 9 Jul 2012 at 1:36
Hi, Marcel,
Since my read file is larger than 10M, I emailed it to you. Please play around
with it.
Thank you.
Best,
Ying
Original comment by yingzi.z...@gmail.com
on 9 Jul 2012 at 6:55
Ok, thanks for the mail. I think I've figured it out. I've only looked at the
comparison of -b vs -a, but the problem is probably the same for -b vs -g.
When you use the -a option to specify a 3' adapter, you're essentially saying
that the adapter *must* occur *after* the sequence you are interested in. The
adapter itself and everything following it is trimmed. This is already in the
README and is probably clear. But, the problem is that in your data, you also
have reads with simulated full-length 5' adapters, which are in the beginning
of the read. If you use -a and such a read is found, the entire read is trimmed
(leaving a read of length zero). In your example above, every time this
happens, it will also increase the counter for reads that were found at full
length (25 or 33 in your case).
I suggest that you update to cutadapt 1.1, in which the histogram output has
changed slightly and should be a little less confusing (although it still is
not perfect, I admit). For cutadapt 1.1, I get output like this when using your
command line with the two '-a' adapters:
=== Adapter 1 ===
Adapter 'AATGATACGGCGACCACCGAGATCT', length 25, was trimmed 30712 times.
Lengths of removed sequences
length count expected
[...]
17 394 0.0
18 415 0.0
19 407 0.0
20 372 0.0
21 426 0.0
22 369 0.0
23 408 0.0
24 400 0.0
25 394 0.0
>=50 1164 0.0
Note how it now says "lengths of removed sequence" and not "Histogram of
adapter lengths", which is different from cutadapt 1.0 and more what one
expects. You'll see immediately that the 394 adapters of length 25 are
correctly found, but that also in 1164 cases, the adapter was found in the
beginning of the read, trimming the entire read of 50 bp. The 1164 is a bit too
high, but this is due to the error rate. If you reduce it to zero, the result
should be better.
(Note that the histogram is cut off at a certain point and for all lengths
after this point, only the cumulative counts are shown. While playing around
with your data, I realized that this is not helpful and I have now changed that
behavior in the Git version of cutadapt.)
I hope this fully explains the discrepancies you've been seing.
Original comment by marcel.m...@tu-dortmund.de
on 9 Jul 2012 at 10:18
Thank you, Marcel, for this clear explanation.
It also showed me why I saw the removal of an entire read in my result.
Just a side question, do you have any recommendation on using CutAdapt to
remove adapter sequences?
Obviously, based on my experiment, I will stick to -b option. If so, what is
the point to use -a/-g option?
In some sense, I think the -g/-a option is mis-leading, because it still
searches for the presence of adapter sequences "anywhere" in the read.
Previously, when I used -a option, I though it would search the adapter
sequences from the 3' end and stop when reaching the full length of the
adapter, which means, -a option would only match the expression pattern for the
25 or 33 bp at the 3' end. I don't expect it to extend the pattern match to 5'
end.
On the other hand, why the -a/-g option will treat the full length adapter
sequences differently than the part sequences? I mean, I don't think there is
significant difference between the 23-, 24-, and 25-bp of contamination.
However, only the 25-bp "contamination" results in over-trimming at 5' end,
which means reads like this "AATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNN"
got trimmed, but reads like this
"ATGATACGGCGACCACCGAGATCTNNNNNNNNNNNNNNNNNNNNNNN" was not trimmed. So how do
you define the pattern match for -a/-g option? Why searching for the 24-bp
contamination will not extend to 5' end?
Original comment by yingzi.z...@gmail.com
on 10 Jul 2012 at 2:39
> do you have any recommendation on using CutAdapt to remove adapter sequences?
It depends on the real data that you have and what you expect could happen. If
an adapter can occur only in the 5' end, then using -a makes sense: If it's in
the beginning of the read, then trimming the read to length zero is the right
thing to do. That could happen when the 5' and 3' adapters are attached to each
other without anything between them. Using -b in that setting would be
problematic as the sequence after the adapter could just be noise or another
adapter.
> In some sense, I think the -g/-a option is mis-leading, because it still
searches for the presence of adapter sequences "anywhere" in the read.
I had to name the options somehow -- it's just a mnemonic. The authoritative
description is in the README. For the -a adapter, it says explicitly that the
read will be empty if the adapter is found in the beginning.
> So how do you define the pattern match for -a/-g option? Why searching for
the 24-bp contamination will not extend to 5' end?
For -a, the start of the adapter must be within the read. There are two reasons.
First, the model behind this is that the sequence is "5' adapter - insert - 3'
adapter". The extreme case is that the insert is empty. Allowing to see a
proper suffix of the 3' adapter would be something else entirely: We then also
allow some kind of degradation of the 3' adapter. But then we'd also need to
allow that we see a degraded adapter when the insert is nonempty. This kind of
degradation may perhaps happen, I don't know. It could even be allowed in
cutadapt, but since the model is different, it would have to be implemented as
a different kind of adapter.
Second, there is a practical reason: If suffixes of a 3' adapter were allowed
to match in the beginning of the read, then short, random matches, in which
only the last few bases of the adapter match would lead to the read being
trimmed to length zero.
If you encounter real data in which the assumptions that cutadapt makes don't
fit the physical process, I'm eager to learn about that and change or extend
cutadapt to model that process (and I have done so in the past).
Original comment by marcel.m...@tu-dortmund.de
on 10 Jul 2012 at 3:39
I see.
Thanks again.
Original comment by yingzi.z...@gmail.com
on 10 Jul 2012 at 3:55
I hope it's ok to close the issue now. Please open a new one for other problems
or e-mail me.
Original comment by marcel.m...@tu-dortmund.de
on 10 Jul 2012 at 4:08
Original issue reported on code.google.com by
yingzi.z...@gmail.com
on 6 Jul 2012 at 5:23