Reference mapping against annotated genomes (C. parvum and C. hominis)

nimapak commented 7 years ago

Question 1: Do we want to map the contigs or reads against the reference genomes? Simone: Yes, I think that this is necessary for a number of things, including identification of core and accessory genes,

Nima: OK great, shall we map the contigs we get from the assembly stage against the reference genome? Or map the raw reads against the reference genome? I guess you want to map reads! Simone: well, we will surely map the reads, but to look for rearrangements, one can use MAUVE, and I think contigs

Question 2: What are the reference genomes? We need ENA accessions. C. parvum:? Study: PRJNA144 ASM16534v1 (IOWA strain) Great this has the full assembly, contigs and chromosomes. http://www.ebi.ac.uk/ena/data/view/PRJNA144 http://www.ebi.ac.uk/ena/data/view/GCA_000165345

            C. hominis:?    Study: PRJNA222836

This has WGS assembly: http://www.ebi.ac.uk/ena/data/view/GCA_001593465 And two sets of reads: http://www.ebi.ac.uk/ena/data/view/SRR1015747-SRR1015748 If we want to use the reads here we may need to investigate the difference between read sets. We should also check how they have assembled it. Does C. hominis also have 8 chromosome like C. parvum?

Simone: yes, C. hominis has 8 chromosomes, too. I will double check what we should retrieve for hominis.

Question 3: Is there any other parasite reference genomes that we wish to include? Please list them and also provide the accession numbers for them. Simone: No, I think we should include the two above because are the best reconstruction of the genome, and we also have fresh transcriptomics data. Other genomes are only drafted. Nima: Great thanks. Question 4: What are the standard parameters that you wish to extract from this (A part from coverage)? Simone: I think mapping is useful to estimate whether there is a bias in the sequencing experiments, i.e., if there are regions that are under-represented. Mapping data can also be used to visualize rearrangements. At least in the commercial software we have(CLC Genomics Workbench), mapping is used to call SNPs, because if one has all the sequences from one isolate mapped to a reference genome, then it is possible to exclude SNPs by selecting initial options (e.g., a minimum and/or maximum coverage,quality of the nucleotide and surrounding nucleotides, accounting for bias in the ratio forward/reverse reads, ecc)

Nima: Yes that’s correct, it is needed for SNPs analysis.

xinliu005 commented 7 years ago

Requirement: 1) Build C. hominis reference genome by using 8 chromosome assembly sequences: LN877947-LN877954 2) Build C. parvum reference genome by using 8 chromosome assembly sequences: CM000429-CM000436 3) a) Update reference-mapping.py to run bwa as the mapping software. The input data shall be either one fastq file or two fastq files, the reference genome fasta file*, and the genome name mapped to. If no reference genome provided, the above genomes 1) or 2) will be used as default. The output file will be SAM file created by bwa.

 b) The script need to get a prefix to add to all the file names, lets say if the user wish to have ERR1234 as the prefix all the files that is created by the script shall start by ERR1234_

c) Also use the name of the mapping program in all the file names, so the naming structure for above examples would be:

ERR1234bwa*

d) The genome name 'ch' or 'cp' will be added to the prefix, so the example name will be:

ERR1234_bwach*

e) The run_type 'single' or 'paired' will be added to the prefix, so the example name will be:

ERR1234_bwa_chpaired*

If the user provide the reference genome by themselves, the following genome index files (.bwt, .pac, .ann, .amb, .sa) should be also provided in the same directory of the genome fasta file. The following command can be used to create the above files: bwa index -a bwtsw .fasta

xinliu005 commented 7 years ago

reference-mapping.py, utilities.py, and properties.txt have been committed in.

xinliu005 commented 7 years ago

Considering add the following BAM correction steps:

=== remove duplicates ===== 1) Use picard-tools to remove duplicates in BAM

=== local realignment ===== 2) Use picard-tools to add group information to BAM 3) Use samtools index to index BAM 4) Run GenomeAnalysisTK-RealignerTargetCreator to create intervals of interest 5) Run GenomeAnalysisTK-IndelRealigner to realign BAM 6) Use picard-tools to Fix mate pair info in BAM

Code: === remove duplicates ===== 1) java -jar /nfs/production/seqdb/embl/developer/xin/bin/picard.jar MarkDuplicates INPUT=/nfs/production/seqdb/embl/developer/xin/tickets/snp/test/SRR6148259_paired.sorted.bam OUTPUT=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.bam METRICS_FILE=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.metrics.txt

=== local realignment =====

2) java -jar /nfs/production/seqdb/embl/developer/xin/bin/picard-tools-1.51/AddOrReplaceReadGroups.jar I=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.bam O=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.bam RGID=0 RGLB=xin_lib RGPL=xin_illunima RGPU=xin1 RGSM=0

3) /nfs/production/seqdb/embl/developer/xin/bin/samtools-1.6/samtools index /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.bam

4) java -jar /nfs/production/seqdb/embl/developer/xin/bin/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 4 -R /nfs/production/seqdb/embl/developer/xin/tickets/ref/cp.fasta -I /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.bam -o /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.realign.intervals

5) java -jar /nfs/production/seqdb/embl/developer/xin/bin/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar -T IndelRealigner -R /nfs/production/seqdb/embl/developer/xin/tickets/ref/cp.fasta -targetIntervals /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.realign.intervals -I /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.bam -o /nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.realigned.bam

6) java -jar /nfs/production/seqdb/embl/developer/xin/bin//picard.jar FixMateInformation INPUT=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.realigned.bam OUTPUT=/nfs/production/seqdb/embl/developer/xin/tickets/ref_map/correction/SRR6148259_paired.sorted.dedup.addGroup.realigned.fixmate.bam SO=coordinate CREATE_INDEX=true

xinliu005 commented 6 years ago

reference-mapping.py was committed just now with better structure and bwa alignment running on multiple threads.

xinliu005 commented 6 years ago

reference-mapping.py was committed just now with multiple commands running individually and system exits if any command exits.

xinliu005 commented 6 years ago

After mapping, the next step will be calling SNP. GATK HaplotypeCaller has a default filter "DuplicateReadFilter". This filter recognizes the SAM flag set by MarkDuplicates. It can be disabled from the command line if needed using the -drf argument. So use picard-tools to mark duplicates in BAM can be used instead of remove duplicates.

EBI-COMMUNITY / ebi-parasite

Reference mapping against annotated genomes (C. parvum and C. hominis) #4