EBI-COMMUNITY / ebi-parasite

GNU General Public License v3.0
4 stars 2 forks source link

Update quality-control.py to do quality control of the data #1

Open nimapak opened 7 years ago

nimapak commented 7 years ago

Please read that and implement it in quality-control.py. Basically we should use trim_galore as the main tool for quality control. In this script the input is fastq files and outputs are validated fastq files. Later on more tools will be added to the script for quality control check. Things to do:

1) get_args(): First create a property file and provided needed information in that file. The property file structure is as follow: trim_galore:{directory of trim_galore} workdir:{home working directory of the program}

Then update get_args() if you require any other parameters. Add another flag to get the quality-control software needs to be used. The first example is trim_galore.

2) run_trim_galore(fastqfiles) shall get list of fastq files in an array, it can be only one or two files.

3) initiate(): within the processing directory you need to create to directories /quality/in/ that you need to keep the input data and then /quality/out/ for the out put data. You can uncomment the code and you almost have everything in this section.

4) execute(): execute trim_galore

5) post_process(): The validate files need to be copies into /quality/out/ . The validated files should have this naming structure. fqout1=fq1+"_val_1.fq" and fqout2=fq2+"_val_2.fq"

6) main runs everything

xinliu005 commented 7 years ago

New flag qc_software(optional) was added to arguments, and trim_galore is set as default if no qc_software was defined. The script will also check whether the imported qc_software's path defined in the property file. If not, it will remind the user and exit. The script was tested and successfully fulfilled other requirements as well. I tried to attach the python script to the issue, but cannot see it. And I do not think I can commit the script under this project. So I will email you the script.

xinliu005 commented 7 years ago

Great, I have successfully committed EBI-COMMUNITY/ebi-parasite/quality-control.py

xinliu005 commented 6 years ago

1) trim_galore does not remove duplicates from fastq files The main function it provides: a) Quality Trimming b) Adapter Trimming c) Removing Short Sequences reference: https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md#paired-end-specific-options

2) Clumpify: remove duplicates from fastq files Clumpify will remove all normal optical duplicates, and all tile-edge duplicates, but it will only consider reads to be tile-edge duplicates if they are in adjacent tiles and share their Y-coordinate (within dupedist). reference: Clumpify - https://www.biostars.org/p/225338/

I may further investigate Clumpify to see the difference before and after it was applied to fastq files, and we may need to add some software to quality-control to remove duplicates from fastq files.

xinliu005 commented 6 years ago

** Did not see trim-galore remove duplicated reads: I have man-made some duplicated read in xin_made_fq1.txt and xin_made_fq2.txt, and then run trim-galore, the duplicates still exists in the result fastq files. Command in the following: quality-control.py -fq1 /nfs/production/seqdb/embl/developer/xin/new_eclipse_dir/working_dir/para_geno_anal/quality/in/xin_made_fq1.txt -fq2 /nfs/production/seqdb/embl/developer/xin/new_eclipse_dir/working_dir/para_geno_anal/quality/in/xin_made_fq2.txt

xinliu005 commented 6 years ago

** clumpify.sh removed my man-made duplicates (ST-E00129:428:HL2FKCCXX:5:1101:10510:2065): /nfs/production/seqdb/embl/developer/xin/bin/bbmap/clumpify.sh in=/nfs/production/seqdb/embl/developer/xin/new_eclipse_dir/working_dir/para_geno_anal/quality/in/xin_made_fq1.txt out=xin_made_fq1.txt.out dedupe optical dist=40

@ST-E00129:428:HL2FKCCXX:5:1101:10500:2065 1:N:0:ATGTCA ATTACTTATTTATTGAAAATGGCCAAAACTAAGATATAGATGAGAATCGTAGGAATTGACAATAAATTGTGAAGTATTGAGAAATAGAGAAATACATACCTAACTAACTCACGACCAAGTACGCCAACCTAACTAACTTTATACATTAAAT + AAFFFFKKKKAKKKKKKKKKKKKKKKKKKKKKKKKKKKFAKKKKKKKKKKKKKKKKKKKKKKKKKKKKAFKKKKKAKKKKFAKKKAKKKKKFKKKKK,FF7FKK,FKKAAFKKKKKKKKKAKFKKKKKKKKKKKKKK7AAFFKKKKFKKFF @ST-E00129:428:HL2FKCCXX:5:1101:10510:2065 1:N:0:ATGTCA (duplicate made by me) ATTACTTATTTATTGAAAATGGCCAAAACTAAGATATAGATGAGAATCGTAGGAATTGACAATAAATTGTGAAGTATTGAGAAATAGAGAAATACATACCTAACTAACTCACGACCAAGTACGCCAACCTAACTAACTTTATACATTAAAT + AAFFFFKKKKAKKKKKKKKKKKKKKKKKKKKKKKKKKKFAKKKKKKKKKKKKKKKKKKKKKKKKKKKKAFKKKKKAKKKKFAKKKAKKKKKFKKKKK,FF7FKK,FKKAAFKKKKKKKKKAKFKKKKKKKKKKKKKK7AAFFKKKKFKKFF

xinliu005 commented 6 years ago

The read duplicates including: 1) PCR duplicates: with same start position on reference, no need to be exactly the same sequences, can be removed or marked after alignment. 2) optical duplicates and tile-edge duplicates: created by sequencing machine software by mistake. Optical duplicates are duplicates positioned extremely close together; and tile-edge duplicates are the duplicates at the edge of the tiles. No need to be the exact sequence, but clients can choose how much bases mismatch permitted in software 'Clumpify'. Can be removed before or after alignment.

Reason for NOT removing duplicates: 1) Except the 100% identical duplicates, for those duplicates with mismatch, you can not identify which one is the biological read, and may remove it as it got relatively lower quality score. 2) You can not remove(mark) all duplicates: 100% of identical duplicates can be easily detected, ('Clumpify'), but to find all duplicates with mismatches is not. The reason behind deduplication by their position on the flow cell is similar to deduplication by mapping - with enough mismatches, "duplicates" may map to different places or not map at all, and then they won't be detected. 3) Some paper reported that PCR duplicate removal has minimal effect on the accuracy of subseqeunt variant calls (Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches, BMC BioinformaticsBMC series – open, inclusive and trusted201617(Suppl 7):239)

Reason for removing duplicates: The duplicates will induce false read depth and bias the SNP call.

My suggestions for the removal of duplicates: 1) In quality-control.py, add option of using 'clumpify' to remove only the exact duplicate, and not removing other duplicates. 2) In the following steps, no need to remove duplicates.

xinliu005 commented 6 years ago

softwares can do both fastq reads QC and deduplication

The following two softwares can to both, but they need adapter input fastq file to remove adaptors.

1) fastq-mcf (provided by ea-utils) fuctions: a) Removes the adapter sequences from the fastq file(s) b) Have option '-D' to remove duplicate reads. c) Filtering options*: --[mate-]qual-mean NUM Minimum mean quality score --[mate-]min-len NUM Minimum remaining length (same as -l) But need to build adptor file by yourself: https://www.neb.com/~/media/Catalog/All-Products/6B6FC6C03B274E7FA0FDBF13015AB194/Datacards%20or%20Manuals/manualE7500.pdf

2) FASTX-Toolkit Available Tools: a) FASTQ-Quality-Filter - removes low-quality sequences from FASTQ files. b) FASTX-Trimmer - Low quality read ends can be trimmed using a fixed length timmer. For example: fastx_trimmer -f 1 -l 80 -Q 33 -i in.fastq -o out.fastq c) FASTX-Clipper - This tool removes (clips) adapters and also need an adapter input fastq file. It also has an option "-l N" to discard sequences shorter than N nucleotides d) FASTX-Collapser - Identical sequences are collapsed into a single sequence. The sequences are renamed with two numbers: a running number followed by how many times that sequence occurred.

3) prinseq: http://prinseq.sourceforge.net/manual.html PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. It is particular designed for 454/Roche data, but can also be used for other types of sequence data. It trims reads based on quality and length, and can remove duplicates, but not trim adapters.