FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

The report is ok but fastqc result shows trimming process hasn't been done #113

Closed xjs1996 closed 3 years ago

xjs1996 commented 3 years ago

**Hi, I used trim_galore to do quality control and trimming on my data, and the command is :

trim_galore --paired --length 20 --fastqc SHEARY-3.R1.fastq.gz SHEARY-3.R2.fastq.gz --output_dir ./cleanFastq

The report seems nothing wrong:**

SUMMARISING RUN PARAMETERS

Input filename: SHEARY-3.R1.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.4_dev Cutadapt version: 2.8 Number of cores used for trimming: 1 Quality Phred score cutoff: 20 Quality encoding type selected: ASCII+33 Using Nextera adapter for trimming (count: 194465). Second best hit was Illumina (count: 20) Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected) Maximum trimming error rate: 0.1 (default) Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp Output file will be GZIP compressed

This is cutadapt 2.8 with Python 3.6.7 Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA SHEARY-3.R1.fastq.gz Processing reads on 1 core in single-end mode ... Finished in 91.87 s (30 us/read; 1.97 M reads/minute).

=== Summary ===

Total reads processed: 3,016,011 Reads with adapters: 1,316,571 (43.7%) Reads written (passing filters): 3,016,011 (100.0%)

Total basepairs processed: 452,401,650 bp Quality-trimmed: 1,794,980 bp (0.4%) Total written (filtered): 415,282,037 bp (91.8%)

=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 1316571 times; Reverse-complemented: 0 times

No. of allowed errors: 0-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters: A: 17.4% C: 32.9% G: 28.2% T: 21.5% none/other: 0.0%

Overview of removed sequences length count expect max.err error counts 1 461895 754002.8 0 461895 2 121361 188500.7 0 121361 3 45890 47125.2 0 45890 4 15999 11781.3 0 15999 5 9551 2945.3 0 9551 6 8644 736.3 0 8644 7 7689 184.1 0 7689 8 7956 46.0 0 7956 9 7284 11.5 0 7249 35 10 7944 2.9 1 7741 203 11 7681 0.7 1 7557 124 12 8045 0.2 1 7832 213 13 7502 0.2 1 7396 106 14 7960 0.2 1 7768 192 15 7953 0.2 1 7800 153 16 8364 0.2 1 8076 288 17 7267 0.2 1 7155 112 18 7259 0.2 1 7123 136 19 7526 0.2 1 7396 130 20 7548 0.2 1 7370 178 21 7497 0.2 1 7323 174 22 7444 0.2 1 7325 119 23 7901 0.2 1 7654 247 24 7615 0.2 1 7468 147 25 7759 0.2 1 7576 183 26 7615 0.2 1 7434 181 27 7744 0.2 1 7541 203 28 7159 0.2 1 6962 197 29 7655 0.2 1 7370 285 30 7379 0.2 1 7255 124 31 7696 0.2 1 7484 212 32 7018 0.2 1 6887 131 33 7573 0.2 1 7353 220 34 7237 0.2 1 7072 165 35 7549 0.2 1 7240 309 36 6951 0.2 1 6819 132 37 7244 0.2 1 7058 186 38 7133 0.2 1 6998 135 39 7436 0.2 1 7199 237 40 7452 0.2 1 7161 291 41 7293 0.2 1 7130 163 42 6847 0.2 1 6686 161 43 7996 0.2 1 7700 296 44 7010 0.2 1 6809 201 45 12115 0.2 1 11787 328 46 3169 0.2 1 3009 160 47 7212 0.2 1 6981 231 48 12088 0.2 1 11750 338 49 7276 0.2 1 7124 152 50 5121 0.2 1 4957 164 51 12516 0.2 1 12286 230 52 5487 0.2 1 5363 124 53 4005 0.2 1 3905 100 54 8058 0.2 1 7891 167 55 11592 0.2 1 11393 199 56 7234 0.2 1 7121 113 57 7966 0.2 1 7844 122 58 4982 0.2 1 4878 104 59 11253 0.2 1 11077 176 60 2564 0.2 1 2491 73 61 3430 0.2 1 3376 54 62 9016 0.2 1 8855 161 63 6132 0.2 1 6038 94 64 2790 0.2 1 2742 48 65 6013 0.2 1 5907 106 66 10703 0.2 1 10554 149 67 2921 0.2 1 2820 101 68 6860 0.2 1 6737 123 69 8041 0.2 1 7851 190 70 12342 0.2 1 12082 260 71 588 0.2 1 534 54 72 386 0.2 1 347 39 73 2103 0.2 1 2039 64 74 3872 0.2 1 3792 80 75 4894 0.2 1 4770 124 76 5285 0.2 1 5153 132 77 5658 0.2 1 5502 156 78 5510 0.2 1 5389 121 79 5525 0.2 1 5366 159 80 5300 0.2 1 5178 122 81 5508 0.2 1 5335 173 82 5365 0.2 1 5190 175 83 5364 0.2 1 5116 248 84 5082 0.2 1 4720 362 85 5290 0.2 1 4841 449 86 5203 0.2 1 4792 411 87 5261 0.2 1 4865 396 88 4796 0.2 1 4516 280 89 5109 0.2 1 4825 284 90 4750 0.2 1 4547 203 91 4656 0.2 1 4458 198 92 4222 0.2 1 4089 133 93 4330 0.2 1 4163 167 94 3846 0.2 1 3768 78 95 4024 0.2 1 3877 147 96 3790 0.2 1 3663 127 97 3707 0.2 1 3548 159 98 3629 0.2 1 3439 190 99 3641 0.2 1 3416 225 100 3633 0.2 1 3383 250 101 3782 0.2 1 3523 259 102 3377 0.2 1 3161 216 103 3439 0.2 1 3211 228 104 3036 0.2 1 2889 147 105 2942 0.2 1 2811 131 106 2594 0.2 1 2494 100 107 2742 0.2 1 2618 124 108 2415 0.2 1 2304 111 109 2477 0.2 1 2355 122 110 2221 0.2 1 2064 157 111 2534 0.2 1 2365 169 112 2187 0.2 1 2018 169 113 1990 0.2 1 1828 162 114 1620 0.2 1 1461 159 115 1235 0.2 1 1049 186 116 1057 0.2 1 863 194 117 1074 0.2 1 801 273 118 901 0.2 1 628 273 119 994 0.2 1 649 345 120 740 0.2 1 478 262 121 574 0.2 1 362 212 122 440 0.2 1 291 149 123 296 0.2 1 199 97 124 158 0.2 1 103 55 125 162 0.2 1 102 60 126 69 0.2 1 37 32 127 53 0.2 1 46 7 128 63 0.2 1 44 19 129 31 0.2 1 28 3 130 23 0.2 1 23 131 42 0.2 1 25 17 132 98 0.2 1 98 133 55 0.2 1 52 3 134 30 0.2 1 23 7 135 39 0.2 1 12 27 136 19 0.2 1 5 14 137 5 0.2 1 3 2 138 8 0.2 1 6 2 139 13 0.2 1 4 9 140 7 0.2 1 2 5 141 22 0.2 1 17 5 142 11 0.2 1 2 9 143 13 0.2 1 11 2 144 31 0.2 1 20 11 145 20 0.2 1 20 146 27 0.2 1 14 13 147 42 0.2 1 42 148 94 0.2 1 92 2 149 24 0.2 1 10 14 150 11 0.2 1 10 1

RUN STATISTICS FOR INPUT FILE: SHEARY-3.R1.fastq.gz

3016011 sequences processed in total

SUMMARISING RUN PARAMETERS

Input filename: SHEARY-3.R2.fastq.gz Trimming mode: paired-end Trim Galore version: 0.6.4_dev Cutadapt version: 2.8 Number of cores used for trimming: 1 Quality Phred score cutoff: 20 Quality encoding type selected: ASCII+33 Using Nextera adapter for trimming (count: 194465). Second best hit was Illumina (count: 20) Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected) Maximum trimming error rate: 0.1 (default) Minimum required adapter overlap (stringency): 1 bp Minimum required sequence length for both reads before a sequence pair gets removed: 20 bp Output file will be GZIP compressed

This is cutadapt 2.8 with Python 3.6.7 Command line parameters: -j 1 -e 0.1 -q 20 -O 1 -a CTGTCTCTTATA SHEARY-3.R2.fastq.gz Processing reads on 1 core in single-end mode ... Finished in 92.22 s (31 us/read; 1.96 M reads/minute).

=== Summary ===

Total reads processed: 3,016,011 Reads with adapters: 1,309,167 (43.4%) Reads written (passing filters): 3,016,011 (100.0%)

Total basepairs processed: 452,401,650 bp Quality-trimmed: 1,531,775 bp (0.3%) Total written (filtered): 416,034,115 bp (92.0%)

=== Adapter 1 ===

Sequence: CTGTCTCTTATA; Type: regular 3'; Length: 12; Trimmed: 1309167 times; Reverse-complemented: 0 times

No. of allowed errors: 0-9 bp: 0; 10-12 bp: 1

Bases preceding removed adapters: A: 17.6% C: 33.0% G: 27.8% T: 21.6% none/other: 0.0%

Overview of removed sequences length count expect max.err error counts 1 462636 754002.8 0 462636 2 121968 188500.7 0 121968 3 45102 47125.2 0 45102 4 15832 11781.3 0 15832 5 9539 2945.3 0 9539 6 8551 736.3 0 8551 7 7659 184.1 0 7659 8 7924 46.0 0 7924 9 7438 11.5 0 7382 56 10 7800 2.9 1 7570 230 11 7704 0.7 1 7554 150 12 7974 0.2 1 7779 195 13 7450 0.2 1 7325 125 14 7866 0.2 1 7660 206 15 7902 0.2 1 7776 126 16 8028 0.2 1 7799 229 17 7436 0.2 1 7318 118 18 7222 0.2 1 7100 122 19 7667 0.2 1 7510 157 20 7370 0.2 1 7246 124 21 7440 0.2 1 7266 174 22 8029 0.2 1 7872 157 23 7165 0.2 1 7050 115 24 7781 0.2 1 7598 183 25 7751 0.2 1 7579 172 26 7629 0.2 1 7496 133 27 7619 0.2 1 7498 121 28 7135 0.2 1 7007 128 29 7132 0.2 1 6991 141 30 7605 0.2 1 7450 155 31 7710 0.2 1 7534 176 32 7117 0.2 1 6967 150 33 7659 0.2 1 7483 176 34 7061 0.2 1 6930 131 35 7331 0.2 1 7195 136 36 7200 0.2 1 7024 176 37 7298 0.2 1 7194 104 38 6964 0.2 1 6831 133 39 7389 0.2 1 7254 135 40 6874 0.2 1 6762 112 41 7132 0.2 1 7016 116 42 7139 0.2 1 6996 143 43 7170 0.2 1 7069 101 44 6864 0.2 1 6745 119 45 7273 0.2 1 7107 166 46 7117 0.2 1 6969 148 47 6855 0.2 1 6740 115 48 7323 0.2 1 7168 155 49 6968 0.2 1 6805 163 50 6769 0.2 1 6679 90 51 7146 0.2 1 7007 139 52 7072 0.2 1 6890 182 53 6580 0.2 1 6470 110 54 6467 0.2 1 6367 100 55 6670 0.2 1 6552 118 56 6777 0.2 1 6657 120 57 7114 0.2 1 6911 203 58 6229 0.2 1 6133 96 59 6907 0.2 1 6791 116 60 6635 0.2 1 6483 152 61 6623 0.2 1 6465 158 62 6306 0.2 1 6228 78 63 7635 0.2 1 7471 164 64 5881 0.2 1 5806 75 65 6001 0.2 1 5906 95 66 5637 0.2 1 5558 79 67 6602 0.2 1 6511 91 68 6913 0.2 1 6772 141 69 6271 0.2 1 6172 99 70 6163 0.2 1 6068 95 71 6512 0.2 1 6381 131 72 6017 0.2 1 5913 104 73 9097 0.2 1 8920 177 74 7988 0.2 1 7844 144 75 9836 0.2 1 9654 182 76 3514 0.2 1 3408 106 77 3991 0.2 1 3909 82 78 4586 0.2 1 4500 86 79 4848 0.2 1 4762 86 80 4792 0.2 1 4680 112 81 4867 0.2 1 4745 122 82 4862 0.2 1 4742 120 83 4964 0.2 1 4807 157 84 4498 0.2 1 4319 179 85 4776 0.2 1 4464 312 86 4698 0.2 1 4429 269 87 4890 0.2 1 4591 299 88 4432 0.2 1 4223 209 89 4843 0.2 1 4654 189 90 4656 0.2 1 4529 127 91 4484 0.2 1 4314 170 92 4188 0.2 1 4071 117 93 4292 0.2 1 4169 123 94 3677 0.2 1 3604 73 95 3940 0.2 1 3836 104 96 3685 0.2 1 3609 76 97 3551 0.2 1 3438 113 98 3554 0.2 1 3449 105 99 3484 0.2 1 3352 132 100 3504 0.2 1 3372 132 101 3529 0.2 1 3386 143 102 3180 0.2 1 3053 127 103 3192 0.2 1 3052 140 104 2832 0.2 1 2732 100 105 2862 0.2 1 2772 90 106 2527 0.2 1 2462 65 107 2716 0.2 1 2644 72 108 2393 0.2 1 2354 39 109 2310 0.2 1 2234 76 110 2076 0.2 1 2027 49 111 2324 0.2 1 2234 90 112 2007 0.2 1 1930 77 113 1783 0.2 1 1728 55 114 1430 0.2 1 1340 90 115 1004 0.2 1 931 73 116 836 0.2 1 767 69 117 729 0.2 1 635 94 118 580 0.2 1 470 110 119 579 0.2 1 454 125 120 400 0.2 1 290 110 121 339 0.2 1 237 102 122 209 0.2 1 150 59 123 173 0.2 1 122 51 124 112 0.2 1 42 70 125 107 0.2 1 76 31 126 47 0.2 1 27 20 127 45 0.2 1 30 15 128 41 0.2 1 38 3 129 48 0.2 1 20 28 130 14 0.2 1 9 5 131 24 0.2 1 23 1 132 14 0.2 1 8 6 133 40 0.2 1 33 7 134 50 0.2 1 12 38 135 52 0.2 1 13 39 136 18 0.2 1 8 10 137 3 0.2 1 3 138 8 0.2 1 5 3 139 15 0.2 1 4 11 140 7 0.2 1 2 5 141 19 0.2 1 15 4 142 12 0.2 1 2 10 143 11 0.2 1 11 144 27 0.2 1 20 7 145 20 0.2 1 20 146 19 0.2 1 15 4 147 47 0.2 1 47 148 108 0.2 1 106 2 149 10 0.2 1 10 150 17 0.2 1 17

RUN STATISTICS FOR INPUT FILE: SHEARY-3.R2.fastq.gz

3016011 sequences processed in total

Total number of sequences analysed for the sequence pair length validation: 3016011

Number of sequence pairs removed because at least one read was shorter than the length cutoff (20 bp): 1116 (0.04%)

but fastqc result of SHEARY-3.R1_val_1_fastqc.zip shows the adaptors were not moved:

图片

Do you have any idea why I got this conflict results?

FelixKrueger commented 3 years ago

I don't think that the results are conflicting: Trim Galore auto-detected the Nextera transposase adapter in your data (sequence is CTGTCTCTTATA), and removed that successfully from both R1 and R2. The way to check this would be looking at the adapter contamination module in FastQC, where you should have seen a line for Nextera adapter in both reads in the untrimmed sample, and this line will be gone in the trimmed samples.

What you are looking at are over-represented sequences. While some of these sequences are present in your library and are probably not desired, they are considered 'contaminants' but are not being targeted as read-through adapter contamination. If you look at the sequences in the plot, they are all different from the Nextera adapter sequence, which also explains why they have not been removed. Not sure how you got sequences into your library that look like Truseq adapter (contamination?), but for sequences without hits it is more likely that they are just over-represented in your sample for other reasons. The first sequence cttatacacatctccgagccc... for example appears to be from the E. coli phage Lambda.

In conclusion, I would just say that the trimming run has worked fine, and you should be able to move on (even though this depends a little on your application). If your next step is an alignment of some sort, contamination Lambda or TruSeq sequences will simply not align, and are therefore purged from your library naturally. Because of these contaminantr you will however see a somewhat reduced mapping efficiency, but looking at the percentages of the contaminations it doesn't seem to be the end of the world. Good luck!

xjs1996 commented 3 years ago

Thank you, Felix. Do you mean that trim_galore can not trim True_seq adaptor with default parameters? Because before using it the fastqc result is this: 图片 and after the pocess, the fastqc result is this: 图片

overrepresented sequences are nearly the same, only percentage changed. Should I give some other information to trim_galore? like adaptor sequences.

FelixKrueger commented 3 years ago

Trim Galore only trims one adapter at the name, and you don't seem to have used TruSeq adapters, but Nextera adapters instead:

Using Nextera adapter for trimming (count: 194465). Second best hit was Illumina (count: 20)
Adapter sequence: 'CTGTCTCTTATA' (Nextera Transposase sequence; auto-detected)

During the auto-detection, Trim Galore found indication 194,465 sequences containing the Nextera adapter (out of 1,000,000), but only 20 (out of 1,000,000) had traces of the Illumina TruSeq adapter. Consequently, only the Nextera adapter was trimmed.

Also, if we are speaking of adapter contamination, you need to look at the FastQC plot calles Adapter content, as this is the read-through adapter contamination that prevents sequences from aliging. If you compare the plots from before trimming and after, you should see a solid contamination of Nextera sequence (before) which then be gone (after). I doubt you will see Illumina Universal adapter (the red line) in your plots, as it is not the adapter you used.

Looking at the Overrepresented sequences gives you an idea about over-represented sequences, or contaminants in your library, but not about read-through adapter contamination. Even though you have (few) sequences in there resembling a TruSeq adapter, it is more likely that this is an adapter dimer (rather than a read-through adapter contamination which Trim Galore is supposed to detect and remove). Similarly, you seem to have contaminations of Lambda DNA (spike-in?) and RNA PCR primers in the library, but also those are contaminants in your library, rather than adapter contaminants. Does that make sense?

All in all, I think a default Trim Galore run did a fine job in removing Nextera contamination, and you should be fine to proceed with this sample.

xjs1996 commented 3 years ago

Understood! Thank you so much~