NCI-CGR / IlluminaSequencingAnalysis

All Illumina Sequencing Related project from Xin will be recorded in this repo
0 stars 0 forks source link

COVID Project: Multiple updates for the code based on bad report #42

Open lxwgcool opened 3 years ago

lxwgcool commented 3 years ago

After Kristie checked three types of report, we found some bugs.

Report folder
T:\DCEG\Projects\Exome\SequencingData\DAATeam\Xin\Project\COVID_WGS\10_05_2021\low_input_01_10_keytable

We have fixed all these bugs and also added some new features into the code. For details, please check the comments below.

lxwgcool commented 3 years ago

Details

1: generate_coverage_report_single.sh

(1) Add the case of "capture kit = "WGS "" (2) Print more info in log file to make it looks better. (3) Add multiple thread feature for samtools

2: global_config_bash.rc

(1) Load samtools 1.13 instead of 1.8 to support the feature of multuple threads. (2) Add "WGS_BED" and "WGS_TOTAL_BASES" a) Homo_sapiens_assembly38.bed

3: pre_calling_wgsqc_single.sh

(1) Move the part of "calling samtools" inside if section of GATK, since the samtools need to be called only if GATK need to be called. (2) Change the way to calcualte BASES_Q_AVE (use the same method as the case of WES) (3) Change the way to collect "MEAN_INSERT_SIZES" a) use column 6 instead of column 5

4: step5_2_generate_coverage_report_batch.sh

(1) send keytable name as an argument (2) append keytable name to coverage report

5: step5_generate_coverage_report_batch.sh

(1) get input arguments at the beginning (2) Append keytable name in the the log file dir (coverage) (3) Correct the caption field of the coverage report. (4) Use 8 cores to submit jobs

6: step6_2_generating_pre_calling_qc_report_batch.sh

(1) send keytable name as an argument (2) append keytable name to coverage report

7: step6_generate_pre_calling_qc_report_batch.sh

(1) get input arguments at the beginning (2) Append keytable name in the the log file dir (pre-calling) (3) Correct the caption field of the pre-calling QC report. (4) Use 8 cores to submit jobs

8: step7b_take_incoming_bams.sh

(1) Load samtools 1.13 instead of 1.8 to support the feature of multuple threads.

9: AutoFramework.py

(1) Print more info in the log file of "MergeBAM" (2) Change the way to call "CoverageReport" 1) Additional argument (2) Change the way to call "PreCallingQCReport" 1) Additional argument (3) Change the way to find pre-calling QC report 1) Use the pattern "strKTName"(keytable name) (4) Corrected the of printing report location

10: MergeSubject.py

(1) Add "-L" to find soft-link (2) Set strASSAYID by using "EZ_WGS_PE"

lxwgcool commented 3 years ago

The issue of bad report (discussed in team, record here)

For coverage report

1: The coverage report seems to be utilizing the exome+UTR capture kit bed instead of a whole genome one (based on the number of capture kit bases covered).

I have changed the code and use “Homo_sapiens_assembly38.bed” for calculation. As a result 2 different bed files will be used for our COVID project

CDS Reference: /data/COVID_WGS/lix33/DCEG/CGF/Bioinformatics/Production/data/CDS/v38/BedFileForRef38_CCDS.MergedOverlap.Brief.bed

Normal Capture kit: /data/COVID_WGS/lix33/DCEG/CGF/Bioinformatics/Production/data/ref38/Homo_sapiens_assembly38.bed

2: Also, can you explain how the %Merge Dup and % Merge Optical Dup columns are calculated?

Please ignore this column. I just checked the code, the original code is out of date: the matrix and caption fields are inconsistent. These %Merge Dup and % Merge Optical Dup are never be calculated.

I have updated the caption fields.

3: Is the “CaptureKit Average Coverage” column intentionally blank?

Same as before: the original code is out of date: the matrix and caption fields are inconsistent. I have updated the caption fields.

For pre-calling qc report

1: I don’t think column Q should be all zeros

Yes, I find the logic of the calculation is incorrect in original code. I have updated the code.

2: the values for columns V through AA don’t make sense

Same as before: the original code is out of date: the matrix and caption fields are inconsistent. I have updated the caption fields.

3: Also, the last two columns have values but are missing headers…

Same as before: the original code is out of date: the matrix and caption fields are inconsistent. I have updated the caption fields.

I also added some parallel computing features in the code.

I have submit jobs to redo these 2 reports and will notify you once everything is all set.