alexstaj / cutadapt

Automatically exported from code.google.com/p/cutadapt
0 stars 0 forks source link

Change between 1.0 and 1.2.1: Ungreedy adapter cutting #61

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
There is a "breaking" change in behaviour between version 1.0 and 1.2.1.

I have a multiplexed paired-end sequencing library with a lot of read-through. 
This causes the reverse complement of the 4-base multiplexing adapter to be 
found at the 3' end of my reads. Version 1.0 of cutadapt cut the last 4 bases, 
and the last 4 bases only, whereas version 1.2.1 trims anything beyond the 
first instance of the 4-mer, regardless of its position.

In my case, the behaviour from 1.0 is the correct one, and I haven't found any 
way to obtain it using version 1.2.1. I've reverted back to using version 1.0, 
but it would be nice to have an "ungreedy" option.

I've included the output from both versions below.

Using version 1.0, I get the following:

cutadapt version 1.0
Command line parameters: -f fastq -a TGAC -o temp_R2.fastq 
Output/Libraries/1c/a.R2.fastq
Maximum error rate: 10.00%
   Processed reads: 22408330
     Trimmed reads: 15739419 ( 70.2%)
   Too short reads: 0 (  0.0% of processed reads)
    Too long reads: 0 (  0.0% of processed reads)
        Total time:    833.72 s
     Time per read:      0.04 ms

=== Adapter 1 ===

Adapter 'TGAC', length 4, was trimmed 15739419 times.

Histogram of adapter lengths
length  count
3   243588
4   15495831

=============================================================

Using version 1.2.1, I get the following:

cutadapt version 1.2.1
Command line parameters: -f fastq -a TGAC -o temp_R2.fastq 
Output/Libraries/1c/a.R2.fastq
Maximum error rate: 10.00%
   No. of adapters: 1
   Processed reads:     22386941
   Processed bases:   1769910658 bp (1769.9 Mbp)
     Trimmed reads:     14939994 (66.7%)
     Trimmed bases:    293854472 bp (293.9 Mbp) (16.60% of total)
   Too short reads:            0 (0.0% of processed reads)
    Too long reads:            0 (0.0% of processed reads)
        Total time:    764.79 s
     Time per read:      0.03 ms

=== Adapter 1 ===

Adapter 'TGAC', length 4, was trimmed 14939994 times.

No. of allowed errors:
0-4 bp: 0

Lengths of removed sequences
length  count   expected    max. errors
3   223541  349796.0    0
4   9089680 87449.0 0
5   169840  87449.0 0
6   172085  87449.0 0
7   27329   87449.0 0
8   124616  87449.0 0
9   80709   87449.0 0
10  66715   87449.0 0
11  58734   87449.0 0
12  50558   87449.0 0
13  56900   87449.0 0
14  62381   87449.0 0
15  61055   87449.0 0
16  63647   87449.0 0
17  60773   87449.0 0
18  64370   87449.0 0
19  61723   87449.0 0
20  63820   87449.0 0
21  64054   87449.0 0
22  63128   87449.0 0
23  64856   87449.0 0
24  62749   87449.0 0
25  65185   87449.0 0
26  65380   87449.0 0
27  65299   87449.0 0
28  64298   87449.0 0
29  65294   87449.0 0
30  66493   87449.0 0
31  64696   87449.0 0
32  66269   87449.0 0
33  66538   87449.0 0
34  65602   87449.0 0
35  67718   87449.0 0
36  68825   87449.0 0
37  66012   87449.0 0
38  69049   87449.0 0
39  73718   87449.0 0
40  67470   87449.0 0
41  66956   87449.0 0
42  68519   87449.0 0
43  68307   87449.0 0
44  67932   87449.0 0
45  69064   87449.0 0
46  65439   87449.0 0
47  66787   87449.0 0
48  66954   87449.0 0
49  65251   87449.0 0
50  67843   87449.0 0
51  66406   87449.0 0
52  65572   87449.0 0
53  64265   87449.0 0
54  63502   87449.0 0
55  63246   87449.0 0
56  64259   87449.0 0
57  61615   87449.0 0
58  60848   87449.0 0
59  60938   87449.0 0
60  60028   87449.0 0
61  59425   87449.0 0
62  58737   87449.0 0
63  58269   87449.0 0
64  57147   87449.0 0
65  56649   87449.0 0
66  55557   87449.0 0
67  55109   87449.0 0
68  54932   87449.0 0
69  53553   87449.0 0
70  52342   87449.0 0
71  52250   87449.0 0
72  50597   87449.0 0
73  49586   87449.0 0
74  49801   87449.0 0
75  48760   87449.0 0
76  47626   87449.0 0
77  46702   87449.0 0
78  45693   87449.0 0
79  44814   87449.0 0
80  44731   87449.0 0
81  44426   87449.0 0
82  42597   87449.0 0
83  42455   87449.0 0
84  40958   87449.0 0
85  40377   87449.0 0
86  40518   87449.0 0
87  38920   87449.0 0
88  38541   87449.0 0
89  37855   87449.0 0
90  35991   87449.0 0
91  36020   87449.0 0
92  35596   87449.0 0
93  33262   87449.0 0
94  29560   87449.0 0
95  33383   87449.0 0
96  33456   87449.0 0
97  24911   87449.0 0
98  37085   87449.0 0
99  45855   87449.0 0
100 1108    87449.0 0

Original issue reported on code.google.com by ericfour...@gmail.com on 3 May 2013 at 3:07

GoogleCodeExporter commented 9 years ago
That’s interesting – when I changed that, I thought that would be the 
correct behavior. In your case, of course, the problem is that the sequence is 
extremely short and therefore is found due to chance quite often.

I won’t have time to look into this within the next weeks, but perhaps the 
following works in your case: You can try to append a character to your adapter 
sequence that does not occur in your reads, such as an “X”. That is, simply 
use the adapter sequence “TGACX”. Cutadapt won’t find the sequence 
anywhere within the read, only at the end, where the X is just ignored. I hope 
that gives you the same results as before.

Original comment by marcel.m...@tu-dortmund.de on 3 May 2013 at 4:00

GoogleCodeExporter commented 9 years ago
I agree that in most cases, the new behaviour will be the correct one, so no 
fault on your part there. Truth be told, my use-case is so simple that sed 
might just do the trick.

I was mostly reporting the issue because it changes existing behaviour in a way 
that I hadn't seen documented in the change-log. I'm currently rerunning an 
analysis pipeline I built a few months back, and I updated my versions of 
cutadapt/bowtie/tophat/etc before doing so. However the results were off, and 
it took me a bit of time to figure out it was because of this change in 
cutadapt. No big loss on my part (I'll just run it again), but it would have 
been nice if the new behaviour had been implemented using a new switch rather 
than -a, or if it would have been in the change-log. 

Changing the switch now would probably be a bad idea (There's certainly a 
script out there that relies on the new behaviour, so it's a catch-22), but a 
note in the change-log would be nice for future users!

Anyhow, this is a very minor gripe in an otherwise great piece of software. 
Thank you very much for your efforts in writing it and making it available!

Best regards,
-Eric Fournier

Original comment by ericfour...@gmail.com on 3 May 2013 at 4:25

GoogleCodeExporter commented 9 years ago
Ok, so this was more of a „please document your changes better“ rather than 
a „I cannot use cutadapt anymore because you broke it“ request.

Sorry about the time it took you to find the problem, I know how annoying it is 
when tools do that.

In my defense, I did document the change:
“Improved the alignment algorithm for better poly-A trimming when there are 
sequencing errors. Previously, not the longest possible poly-A tail would be 
trimmed.”
Admittedly, this does not include the information that trimming behavior in 
general has changed. I simply didn’t foresee that this would be relevant for 
non-poly-A trimming.

I try to avoid making backwards incompatible changes and I’ll refrain even 
more from doing so in the future. However, I considered the old behavior to be 
a bug and I want the default to be as useful as possible, which is why I 
changed the behavior of the -a parameter.

I’ll leave the report open for a while until I have some time to solve this 
in a more general way, perhaps by adding a few more automatic tests.

Original comment by marcel.m...@tu-dortmund.de on 3 May 2013 at 6:04

GoogleCodeExporter commented 9 years ago
I'd like to second the utility for a lazy/ungreedy option. This would be 
useful, for example, in iteratively trimming short repeats of variable number 
and length. 

Original comment by david.ko...@gmail.com on 7 Jul 2013 at 5:41

GoogleCodeExporter commented 9 years ago
I’m just going through old, unresolved issues. David, Eric, in case you’re 
still interested in this, I wonder whether the workaround I mentioned above is 
sufficient. The idea was to add some 'X' characters to the end of the adapter 
sequence. The effect is that the adapter is only found if it’s at the end of 
the read, but not within the read (where the 'X' characters cause mismatches).

Original comment by marcel.m...@tu-dortmund.de on 5 Nov 2014 at 1:39

GoogleCodeExporter commented 9 years ago
Hi! Sorry for the delay in posting a reply, as I don't check gmail very often 
and that's where my google code notifications get sent. The proposed workaround 
seems perfectly fine with me. Thanks for following up!

Original comment by ericfour...@gmail.com on 15 Dec 2014 at 6:25

GoogleCodeExporter commented 9 years ago
Thanks for getting back! I’ll close this issue then. I’ll put the advice 
about using the ‘X’ as part of the adapter sequence into the documentation.

Original comment by marcel.m...@tu-dortmund.de on 16 Dec 2014 at 1:25