alexjvr1 / Velocity2020

NERC Velocity Project analyses using Sanger genomes
1 stars 0 forks source link

Velocity2020

NERC Velocity Project analyses using Sanger genomes

Here I'll curate the variant calling pipeline and analyses undertaken using the Lepidoptera genomes generated by Sanger in 2019/2020.

Darwin Tree of Life (DToL) data - live results

Pipeline

1. Raw to cleaned and processed data

1a. Trim adapter sequence using cutadapt

1b. Concatenate resequenced museum data (some individuals have been sequenced >1)

1c. Repair problems in museum PE data for data from 1.2. (BBrepair)

1d. Merge overlapping PE reads in museum data (BBmerge)

2. Map and process

2a. Map museum and modern data to Sanger genome

2b. Correct museum data for possible deamination (MapDamage -> output = corrected bam file)

2c. Downsample modern data to the same depth as the museum data

3. SNP discovery and filtering

3.1 ANGSD

3.1.a. ANGSD filters for SFS (ie. no MAF)

3.1.b. [ANGSD filters for population genomics]

3.2 VARIANT CALLING

3.2.1 [Call variants with bcftools mpileup and call]()

4. Analyses: Outliers

4a. [Outlier analysis in ANGSD]

4b. [Manhattan plot]

4c. [Table of functions]

4d. [Network analysis]

5. Analysis: Genetic diversity and population structure

6. Analysis: LD

6a. [ANGSD estimate LD across the genome]

DATA: Genome

Aphantopus hyperantus (Ringlet) was the first genome available (NCBI link), so the pipeline will be set up with this species.

DATA: WGS

Whole genome resequencing data was generated for 38 & 40 modern individuals (sampled 2016-2017 & 2019) from a core and expanding population. Museum data was generated from 48 individuals + resequencing of a subset of individuals to increase read coverage.

Note on renaming files

Liverpool raw data is named with digits and a dash before the sample names. e.g. 33-AH-01-1900-47_191121_L001_R2.fastq.gz The easiest way to rename them is with the Perl rename (note that the native linux rename works the same as mv and is not so useful in this case). Install the perl script (a version curated here). On bluecp3 I've installed this in my software folder: /newhome/aj18951/software/rename-master/rename.

This works with the sed syntax. e.g. this will remove numbers plus dash from the start of the file names:

../../software/rename-master/rename 's/^[0-9]+-//' *

Pipeline for Velocity project from raw data to mapped reads:

1. Raw to cleaned and processed data

1a Demultiplex and Adapter trimming

TIME:

This runs in 1-2 hours for the full dataset (museum + modern)

METHOD:

Modern samples arrive demultiplexed by the sequencing facility, but Museum samples need to be demultiplexed.

We're trimming all adapter sequence from the demultiplexed data. We're also removing all sequences that are shorter than 20bp and 3' quality trimmed to remove bases with PHRED quality score of < 20 with Cutadapt.

If you're running this on BlueCrystal, you'll have to install cutadapt locally first using [these instructions] (https://cutadapt.readthedocs.io/en/stable/installation.html) - see below. We're using cutadapt version 3.4:

module load languages/python-anaconda3-5.2.0

module load tools/cmake-3.8.1
module load tools/autoconf-2.69
module load languages/gcc-9.1.0
module add tools/nasm-2.15.05

#install cutadapt in your home directory using the web instructions
pip3 install --user --upgrade cutadapt

#Check that this cutadapt works
~/.local/bin/cutadapt --help

##Check if this directory is in your PATH:
echo $PATH

##And add to PATH if it isn't yet
PATH="$PATH:~/.local/bin/"

##Now you can run cutadapt directly
cutadapt --help

cutadapt version 3.4
Generate submission script

Run the following scripts to generate the submission script

01a_museum_cutadapt_filtering_trimming.sh

01a_modern_cutadapt_filtering_trimming.sh

Submission script

The above scripts will generate this:

01a_parallel_cutadapt_bluecp3.sh

Edit this submission script to submit from your home directory:

1. Set all paths to your home directory if necessary. 

2. Adjust the number of threads (PBS -t 1-xx) to equal the number of individuals to be analysed. 

3. Check that any empty arguments have been removed from the cutadapt command

4. You might have to set the path to cutadapt to find your local version

1b Concatenate museum reseq data

TIME

~30-40min

METHOD

A subset of individuals (33 per species) have been sequenced twice to increase mean depth. The data from both sequencing runs need to be concatenated together after adapter trimming. We're using these scripts:

1b_concat.fastq.R1.sh and 1b_concat.fastq.R2.sh

Reseq data are kept in the following folders:

00_raw_data_museum2

01a_museum2_cutadapt_reads

01a_mus.concat_cutadapt_reads  ## concatenated museum1 and museum2 + all samples that didn't have reseq data added. I'll point to this folder when mapping

02a_museum_mapped  ##see below. This contains all data including concatenated reseq samples. 

Rename samples

We don't need the really long names given to the samples by the sequencing facilities. We'll rename samples before proceeding further:

Museum

cd 01a_mus.concat_cutadapt_reads/

#move all the log files into a folder
mkdir logfiles
mv *log logfiles

#change the names
~/software/rename_master/rename 's/long_name/new_name/' *gz

##remove all the extra info about lane number and date etc. Final names will bein this format: 

AH-01-1900-01_mus_R1.concat.fastq.gz  AH-01-1900-17_R2.fastq.gz         AH-01-1900-34_mus_R1.concat.fastq.gz
AH-01-1900-01_mus_R2.concat.fastq.gz  AH-01-1900-18_R1.fastq.gz         AH-01-1900-34_mus_R2.concat.fastq.gz

Modern

I'm mapping simultaneously to mod.core and mod.exp, but I'll keep the cutadapt reads in their different folders. 

cd 01a_modern_cutadapt_reads
#move all the log files into a folder
mkdir logfiles
mv *log logfiles

#change the names

~/software/rename-master/rename 's/R1_001.fastq.gzcutadapt_filtered/mod.core/' *gz
~/software/rename-master/rename 's/R2_001.fastq.gzcutadapt_filtered/mod.core/' *gz

AH-01-2016-01_mod.core_R1.fastq.gz  AH-01-2016-16_mod.core_R1.fastq.gz  AH-01-2017-29_mod.core_R1.fastq.gz
AH-01-2016-01_mod.core_R2.fastq.gz  AH-01-2016-16_mod.core_R2.fastq.gz  AH-01-2017-29_mod.core_R2.fastq.gz

cd  01a_modern.exp_cutadapt_reads
#move all the log files into a folder
mkdir logfiles
mv *log logfiles

~/software/rename-master/rename 's/R1_001.fastq.gzcutadapt_filtered/mod.exp/' *gz
~/software/rename-master/rename 's/R2_001.fastq.gzcutadapt_filtered/mod.exp/' *gz

AH-02-2019-42_mod.exp_R1.fastq.gz  AH-02-2019-57_mod.exp_R1.fastq.gz  AH-02-2019-71_mod.exp_R1.fastq.gz
AH-02-2019-42_mod.exp_R2.fastq.gz  AH-02-2019-57_mod.exp_R2.fastq.gz  AH-02-2019-71_mod.exp_R2.fastq.gz

1c Prepare museum data for MapDamage (2.2): Repair PE reads

TIME

~1hour

METHOD

Museum data is prone to post-mortem damage which we will correct for using MapDamage (see 2.2 below). We need to pre-process the museum reads for this. First we need to correct any problems with the PE files. The most common problem we've found is mismatches between R1 and R2 files e.g. not equal in length or mismatches between names. We can correct for this using BBtools's repair.sh script.

BBtools can be installed from the downloaded tarball

We have it installed here

/newhome/bzzjrb/Software/bbmap/

We'll repair the museum data using this script: 01c_bbtools_repair_museum_ARRAY.sh

The input files will be all the museum fastq files found in 01a_mus.concat_cutadapt_reads

Create two files with names for the inputs in the home directory for the species (e.g. /E3_Aphantopus_hyperantus_2020/):

ls 01a_mus.concat_cutadapt_reads/*R1*fastq.gz >> R1.museum.names.torepair
ls 01a_mus.concat_cutadapt_reads/*R2*fastq.gz >> R2.museum.names.torepair

##remove the file path from the name files. We need only in the input file names. 
sed -i 's:01a_mus.concat_cutadapt_reads/::g' *torepair

This will write all the repaired files to: 01c_musPERepaired

#make the output directory within the species home directory:
mkdir 01c_musPERepaired 

Check that the correct number of threads are specified (e.g. 48 indivs: PBS -t 1-48)

And submit:

qsub 01c_bbtools_repair_museum_ARRAY.sh 

1d Prepare museum data for MapDamage (2.2): Merge overlapping PE reads

TIME

30-40min

METHOD

MapDamage requires that overlapping PE reads be merged. Our museum libraries were sequenced with 75bp PE kits, except for the final (museum4) library which was sequenced with 50bp PE.

We know from the library prep that the museum DNA is quite degraded. Thus we expect that the vast majority of the inserts would have overlapping PE sequences.

We can count and merge these using bbtools' merge script.

Edit 01d_bbtools_merge_museum_ARRAY.sh to set up the array submission script.

Make the output directory and we need the names files again:

mkdir 01d_musAll_merged

ls 01c_musPERepaired/*R1*gz >> R1.museum.names.repaired
ls 01c_musPERepaired/*R2*gz >> R2.museum.names.repaired

sed -i 's:01c_musPERepaired/::g' *repaired

This outputs 3 files for each individual: a merged file and an unmerged R1 and R2 file. We will proceed only with the merged reads at this point.

As these are no longer PE data, we can map as if we had a single read.

2. Map and process

Map museum and modern samples to the Sanger reference genome. Thereafter correct the museum bam files using MapDamage, and downsample the modern data to the same final depth as the corrected museum bam files.

2a Map to Reference Genome

TIME:

Museum samples (n=48) ~3 hours

Modern Core (n=38) ~10 hours

Modern Exp (n=40) ~10 hours for all but two samples which had to be restarted. They then ran in 4 hours...

METHOD:

It is more efficient to run this code in local directory before submitting the mapping script to queue

#Index the reference genome if needed. Check if the *fasta.fai* file exists in the SpeciesName/RefGenome/ folder in your local directory. If not, run the indexing code. 

#index reference genome
module load apps/bwa-0.7.15
bwa index RefGenome/*fasta

#Create files with input names

## museum
##Repaired and merged = 1 file per indiv
ls 01d_musAll_merged/*repaired.merged*fastq >> mus.merged.names

sed -i s:01d_musAll_merged/::g merged.museum.names

## modern
## Cutadapt and adapter trimmed = 2 files per indiv
ls 01a_modern_cutadapt_reads/*R1_paired* >> R1.modern.names
ls 01a_modern_cutadapt_reads/*R2_paired* >> R2.modern.names

sed -i 's:01a_modern_cutadapt_reads/::g' *names
#Optional 
#If we want to keep the core and expanding samples in separate folders we can leave the path to the indivs in the *names file

#make output directories. 
mkdir 02a_museum_mapped
mkdir 02a_modern_mapped

#Optional
#Add the additional folders in the 02a_modern_mapped folder when working with an EXPANDING species as the output will be written there. 
mkdir 02a_modern_mapped/01a_modern_cutadapt_reads
mkdir 02a_modern_mapped/01a_modern.exp_cutadapt_reads

#Check that you're pointing to the correct reference genome

#Check that the file separator makes sense: 
##sample_name=`echo ${NAME1} | awk -F "_R" '{print $1}'`
#Change the -F "xxx" according to the file names. 
#e.g the above works for files named as follows: 
#HS-01-2016-26_L007_cutadapt_filtered_R2.fastq.gz
#we want only the first part of this name to carry through. 

Run the submission scripts:

02a_MapwithBWAmem.ARRAY_museum.merged.sh

02a_MapwithBWAmem.ARRAY_modern.sh

If we don't merge the museum reads, we can use this script to map the PE reads:

02a_MapwithBWAmem.ARRAY_museum.sh

Check that everything has mapped correctly by checking the file sizes. If the mapping is cut short (e.g. by exceeding the requested walltime) the partial bam file will look complete and can be indexed. But the bam file size will be small (~500kb) and empty when you look at it.

#To determine file size

du -sh *bam   

#To see bam file
module load apps/bcftools-1.8
bcftools view file.bam | head
Check the output with samtools flagstat

module load apps/samtools-1.9.1
samtools flagstat file.bam

#make a flagstat log file for all of the samples
for i in $(ls *bam); do ls $i >>flagstat.log && samtools flagstat $i >> flagstat.log; done

Index the bam files with the script 02a_index.bamfiles.sh

2b. MapDamage run on museum data

TIME

3-4 hours for 48 samples

METHOD

MapDamage2 is a package used to estimate and correct for Cytosine deamination (or any other transition/transversion bias in the data). This is a problem anticipated for ancient DNA, and possibly for museum data.

This needs to be locally installed on BlueCrystal. Follow the git install instructions in the tutorial here

As the R libraries are loaded into the tmp memory we need to reinstall this every time we want to use MapDamage (after logging off)

If there are any issues, the easiest thing to do is to reinstall MapDamage.

module load languages/R-3.6.3-gcc9.1.0
languages/python-3.8.5

R

install.packages("inline")
install.packages("gam")
install.packages("Rcpp")
install.packages("ggplot2")
install.packages("RcppGSL")

MapDamage will be run in the 02a_museum_mapped folder.

  1. Create a file listing all the bamfiles

ls *bam > bamlist


2. Copy the script [02b_mapDamage_museum.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02b_mapDamage_museum.sh) to the 02a_museum_mapped folder. Change the job name, the number of threads, and check the path to the reference genome.

Submit to queue.

Move all the new rescaled bam files to a new folder: 

mkdir 02b_museum_mapDamage mv 02a_museum_processed.mapped/results/bam 02b_museum_mapDamage && cd 02b_museum_mapDamage

module load apps/samtools-1.9 for i in $(ls *bam); do ls $i >> flagstat.log && samtools flagstat $i >> flagstat.log; done


And index all the bams using the previous indexing script

cp ../02a_museum_mapped/02a_index.bamfiles.sh

ls *bam > bamlist

qsub 02a_index.bamfiles.sh


##### ->>>  To call variants skip to section 3.2

#### 2c. Downsample modern data to the same coverage as in the museum samples

Due to the difference in sample quality between museum and modern samples, mean coverage is much higher for the modern data. This may bias the confidence in variant calls downstream. To avoid this problem I will downsample the modern data to the same mean depth as the museum data.

First filter the bam files to include only reads with PHRED quality >20 and properly paired reads using the [02c_Filter_modern_bam_pp.PHRED20.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02c_Filter_modern_bam_pp.PHRED20.sh) script.

We'll need the names file: 

ls 02a_modern_mapped/*bam >> bamfiles.mod.names sed -i 's:02a_modern_mapped/::' bamfiles.mod.names


Use samtools flagstat to calculate the number of properly paired reads in the recalibrated and filtered museum files.

module load apps/samtools-1.8 for i in $(ls results/flt.bam); do ls $i >> mus.flagstat.log && samtools flagstat $i >> mus.flagstat.log; done

Do the same for the modern samples.

Enter these data in the "Rescaled.ProperlyPaired.Q20" column in the Velocity_MapingStatsPerSpecies_AJvR_20190604.xlsx sheet on Dropbox. Calculate the mean number of museum reads and the proportion of modern reads to downsample to.

Use the [02c_Downsample_mod_ARRAY.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02c_Downsample_mod_ARRAY.sh) script to downsample the modern bam files. Remember to change the job name and the PROP variables and create the input file listing all the modern bams.

### 3.1. ANGSD

There seem to be improperly paired reads in my final dataset. To deal with this I need to change the ANGSD flag -only_proper_pairs to 0. 

Also make sure to use the latest commit of ANGSD. 

I'm using: 

angsd version: 0.933-18-gfd1a21a (htslib: 1.10.2-61-g8859b09) build(May 6 2020 14:42:05)


#### Set up analysis

To speed up the analysis I will split the ANGSD run across the genome; i.e. all indivs will be analysed for regions 0-x in ARRAY1, x-x1 in ARRAY2, etc. For this we need to split the genome up into regions.

Find all the regions (i.e. all chromosomes and contigs) from the reference index file (.fasta.fai):

awk '{print $1}' ../RefGenome/*.fna.fai >> regions

cat regions |wc -l

87

We'll run these in an ARRAY.


The shortest contig is 11048 and longest is 18856181 for Ringlet

###### *Filters*

I'm starting with bam files, so there are already some filters on the mapping quality of the sequences. Prior to that there are some crude filters during the demultiplexing and trimming steps.

Other possible filters: 

-b[filelist]

-remove_bads 1 : remove reads with 255 flag (not primary, failure and duplicate reads) (1=default)

-uniqueOnly 1 : remove reads with multiple best hits

-minMapQ 20 : PHRED 20. This should already be in place during the mapping.

-minQ 20 : PHRED 20 for individual base score.

-only_proper_pairs 0 : NBNB THIS flag is changed to 0 because some of the reads in my final mus files are not properly paired! 
***OLD FILTER include only properly paired reads (default) and should already have been applied to the museum reads prior to this.  

-trim 0 : We're not trimming any data

-baq 1 : estimate base alignment quality using samtools method. BAQ1 and BAQ2 give me very different final variant counts. BAQ1 is more stringent, but could be removing more loci. BAQ2 has a higher number of false positives. See [here](https://www.biostars.org/p/440490/), and discussion on the ANGSD forum [here](https://github.com/ANGSD/angsd/issues/106) and [here](https://github.com/ANGSD/angsd/issues/97)

###ALLELE FREQUENCY ESTIMATION

-doMajorMinor 4 : Force Major allele based on reference. The minor allele is then inferred using doMajorMinor 1. This option needs to be used when calculating SFS for multiple populations as ANGSD otherwise determines a minor allele within each population. I.e. this may not be the same across all the populations.

-ref [..fasta] : For doMM 4 above we need to specify a reference genome.

-doMaf 10 : calculate minor allele. Estimators are added to gether, so doMaf 10 = doMaf 2 (Known major, unknown minor) + doMaf 8 (directly from counts) 

-doPost 1: calculate posterior probabilities using allele frequencies as prior (or 2 for uniform prior, or 3 for SFS prior)

-SNP_pval 0.05 : Only work with SNPs with a p-value above [float]  ** I'm using 0.05 here because too many loci are removed from the MUS dataset at higher p-values. (see [04a_ANGSD_testFilters.md](https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_testFilters.md))

-GL 1 : I will estimate genotype likelihoods using the SAMtools model

-minInd 18 : I will remove loci where less than 18 individuals have been genotyped. There are 35-40 indivs per group (MUS, MODC, MODE). This number was chosen to ensure we can find MAF of 2-5% (1/(18*2)=2.8%)   

 * I excluded the minInd filter in the final dataset as genotype likelihoods should take this into account. 

-setMinDepth : Discard site if total depth (across all indivs) is below [int]. Use -doCounts to determine the distribution of depths

-setMaxDepth : Discard site if total depth (across all indivs) is above [int]  ##I'll use mean + 2*SD

-setMinDepthInd 2 : Minimum depth for a locus for an individual. This is only applicable for analyses using counts (-doCounts obligatory)

-setMaxDepthInd [int]: I'll use a max of meanDP + 2xSD of depth.

-doCounts 1 : Count of the nucleotide bases per individual

-dumpCounts 2 : write a file with all the allele counts per position per individual

-rmTriallelic 1 : include only biallelic loci

-checkBamHeaders 1 : check that the bam headers are compatable for all files.

#### MaxDepth

To estimate the setMaxDepth filter per population, I first need to write the depth per site in ANGSD. 
Using the prepared SFS script add in the following: 

-dumpCounts 1 ##This will write global depth count per site in a .pos file


Calculate the mean and SD 

pwd /Users/alexjvr/2020.postdoc/Velocity/E3/ANGSD_FINAL/DepthEstimates/DepthPerBP MODE.LR761675.1.OCT14.pos.gz MODC.LR761675.1.OCT14.pos.gz MUS.LR761675.1.OCT14.pos.gz

R

MODC.pos <- read.table(gzfile("MODC.LR761675.1.OCT14.pos.gz"), header=T) MODE.pos <- read.table(gzfile("MODE.LR761675.1.OCT14.pos.gz"), header=T) MUS.pos <- read.table(gzfile("MUS.LR761675.1.OCT14.pos.gz"), header=T)

mean(MODC.pos$totDepth)+(2(sd(MODC.pos$totDepth))) [1] 179.6283 mean(MODE.pos$totDepth)+(2(sd(MODE.pos$totDepth))) [1] 293.1328 mean(MUS.pos$totDepth)+(2*(sd(MUS.pos$totDepth))) [1] 80.07606


#### 3a. ANGSD filters for SFS

It's important to include all loci in the SFS. Don't include any filters that'll affect the allele frequencies (e.g. MAF or SNP_pval filters). 

It's also important to polarise the Major/Minor alleles so that they're the same between the populations. 

This is done with -doMajorMinor 4 : Force Major allele based on reference. The minor allele is then inferred using doMajorMinor 1 (i.e. from GL). This option needs to be used when calculating SFS for multiple populations as ANGSD otherwise determines a minor allele within each population. I.e. this may not be the same across all the populations.

The order of the filters do not seem to affect the outcome (see https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_testFilters.md)

To add depth filters we need to add -doCount 1. 

NBNB for MUS data we have to select only_proper_pairs 0 - this is because the processing with MapDamage changes the reads in some way so that they all get filtered out if we select only_proper_pairs 1. They should already be filtered for proper_pairs based on prior mapping filters. 

The [basic process is](https://github.com/ANGSD/angsd/issues/259): 

###### 1. Estimate SAF for each population (unfolded) 

MUS

Need gcc to be loaded

module load languages/gcc-9.1.0

ANGSD located here:

/newhome/aj18951/bin/angsd/angsd

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MUS.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MUS.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 0 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 144 -doMajorMinor 4 -out MUS -C 50 -baq 1

-> Total number of sites analyzed: 4786628 -> Number of sites retained after filtering: 4762287


MODC

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MODC.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MODC.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 328 -doMajorMinor 4 -out MODC -C 50 -baq 1

-> Total number of sites analyzed: 5715589 -> Number of sites retained after filtering: 5621808


MODE

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MODE.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MODE.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 621 -doMajorMinor 4 -out MODE -C 50 -baq 1

-> Total number of sites analyzed: 5688111 -> Number of sites retained after filtering: 5553623


###### 2. unfolded SAF used to produce folded 2D SFS

Then generate the folded SFS for each population pair per chromosome. If the dataset is smaller this can be run on a merged dataset (realSFS cat xx -fold 1 -outnames pop.saf.out), but this is too memory intensive for the whole genome datasets. 

realSFS pop1.unfolded.saf.idx pop2.unfolded.saf.idx -fold 1 >folded.sfs

~/bin/angsd/misc/realSFS MODC.saf.idx MODE.saf.idx -fold 1 > MODE.MODC.fold.sfs

~/bin/angsd/misc/realSFS MODC.saf.idx MUS.saf.idx -fold 1 > MUS.MODC.fold.sfs


##### 3. unfolded SAF and folded SFS used to generate per-site numerator/denominator of Fst (use -fold in realSFS fst index)

realSFS fst index pop1.unfolded.saf.idx pop2.unfolded.saf.idx -sfs folded.sfs -fold 1 -fstout persite

~/bin/angsd/misc/realSFS fst index MODC.saf.idx MUS.saf.idx -sfs MUS.MODC.folded.sfs -fstout MUS.MODC.fstout ~/bin/angsd/misc/realSFS fst index MODC.saf.idx MODE.saf.idx -sfs MODE.MODC.folded.sfs -fstout MODE.MODC.fstout


##### 4. sum numerator and denominator in windows

realSFS fst stat2 persite.fst.idx -win XXXX -step XXXX >window.fst

~/bin/angsd/misc/realSFS fst stats2 here.MUS.MODC.fst.idx -win 50000 -step 10000 > slidingwindow.MUS.MODC ~/bin/angsd/misc/realSFS fst stats2 MODE.MODC.persite.fst.idx -win 50000 -step 10000 > MODE.MODC.slidingwindow


##### GlobalFST 

1. MODC vs MODE

~/bin/angsd/misc/realSFS fst stats MODE.MODC.persite.fst.idx

-> FST.Unweight[nObs:5498007]:0.020322 Fst.Weight:0.126046 0.020322 0.126046


2. MODC vs MUS

~/bin/angsd/misc/realSFS fst stats MUS.MODC.fstout.fst.idx -> Assuming idxname:MUS.MODC.fstout.fst.idx -> Assuming .fst.gz file: MUS.MODC.fstout.fst.gz -> FST.Unweight[nObs:4738713]:0.091403 Fst.Weight:0.234494 0.091403 0.234494


##### Diversity stats with ANGSD (theta)

MODE

~/bin/angsd/misc/realSFS MODE.saf.idx -fold 1 > MODE.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MODE.saf.idx -sfs MODE.fold.sfs -outname MODE.fold.thetas

MODC

~/bin/angsd/misc/realSFS MODC.saf.idx -fold 1 > MODC.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MODC.saf.idx -sfs MODC.fold.sfs -outname MODC.fold.thetas

MUS

~/bin/angsd/misc/realSFS MUS.saf.idx -fold 1 > MUS.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MUS.saf.idx -sfs MUS.fold.sfs -outname MUS.fold.thetas


##### Depth vs Fst and Depth vs nucleotide diversity

I'm concerned that the sequencing depth of the samples might affect the final Fst and nucleotide diversity estimates. This is particularly problematic given the difference in sequencing depth between museum and modern samples. Here I plot depth vs Fst and depth vs nucleotide diversity for each of the datasets. 

Nucleotide diversity and Fst have been estimated above. 

###### 1. Obtain depth estimates: 

ANGSD estimates depth using the -doCounts 1 -doDepth 1 -dumpCounts 2

.counts.gz = depth per individual per site. Columns = indivs, Rows = sites
.pos.gz = totalDepth per site across all indivs. 

See [03_DepthEstimate.md](https://github.com/alexjvr1/Velocity2020/blob/master/03_DepthEstimate.md) for my estimates of depth across the different datasets and how I chose my minDP and maxDP filters. 

#### 2. Call GL

Test dataset: 
GL are called using the samtools method (GL1), and we're only allowing SNPs with a p-value of 0.001 for modern samples, and 0.05 for museum samples. 

Total sites analyzed and kept for each dataset: 

MOD Core p0.001 per SNP

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/03a_ANGSD_SFS/logfiles

grep "Total number of sites analyzed" E3_MODC.ANGSD1.ARRAY.o9325111-* | awk -F "analyzed:" '{s+=$2} END {print s}' 381121498

grep "retained" E3_MODC.ANGSD1.ARRAY.o9325111-* | awk -F "filtering:" '{s+=$2} END {print s}' 6645193


MOD Exp p0.001 per SNP

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/03a_ANGSD_SFS/logfiles

grep "Total number of sites analyzed" E3_MODE.ANGSD1.ARRAY.o9325110-* | awk -F "analyzed:" '{s+=$2} END {print s}' 379718544

grep "retained" E3_MODE.ANGSD1.ARRAY.o9325110-* | awk -F "filtering:" '{s+=$2} END {print s}' 5148017


*p0.05 per SNP*

MOD Core p0.05 per SNP

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/03a_ANGSD_SFS/logfiles

grep "Total number of sites analyzed" E3_MODC.ANGSD1.ARRAY.o9541530-* | awk -F "analyzed:" '{s+=$2} END {print s}' 381121498 ##OLD p0.001 381142712

grep "retained" E3_MODC.ANGSD1.ARRAY.o9541530-* | awk -F "filtering:" '{s+=$2} END {print s}' 6645193 ##OLD p0.001 7529770


MOD Exp p0.05 per SNP

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/03a_ANGSD_SFS/logfiles

grep "Total number of sites analyzed" E3_MODE.ANGSD1.ARRAY.o9541531-* | awk -F "analyzed:" '{s+=$2} END {print s}' 379718544 ##old p0.001 379785246

grep "retained" E3_MODE.ANGSD1.ARRAY.o9541531-* |awk -F "filtering:" '{s+=$2} END {print s}' 5148017 ##old p0.001 6265127


MUS

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/03a_ANGSD_SFS/logfiles

grep "Total number of sites analyzed" E3_ANGSD1.ARRAY.o9403374* | awk -F "analyzed:" '{s+=$2} END {print s}' 346540624

grep "retained" E3_ANGSD1.ARRAY.o9403374* | awk -F "filtering:" '{s+=$2} END {print s}' 23542320


ANGSD is performed across all indivs for each contig/scaffold seperately. I then need to merge all of the data into a single file. 

module load languages/gcc-6.1

~/bin/angsd/misc/realSFS cat BA.HOD.*idx -outnames MUS.MERGED


#### 2d. ISSUES

##### *1. Input bam files*

I've had problems with reading merged bam files created by BBmerge into R. I've reported the issue [here](https://github.com/ANGSD/angsd/issues/260) - it seems to be the same problem encountered by someone using inputs using miniMap2

[bammer_main] 1 samples in 1 input files -> Parsing 1 number of samples No data for chromoId=0 chromoname=LR761647.1 This could either indicate that there really is no data for this chromosome Or it could be problem with this program regSize=0 notDone=0

-> Done reading data waiting for calculations to finish -> Done waiting for threads -> Output filenames: ->"angsdput.arg" -> Fri Apr 17 16:45:14 2020

-> Arguments and parameters for all analysis are located in .arg file -> Total number of sites analyzed: 0 -> Number of sites retained after filtering: 0 [ALL done] cpu-time used = 4.24 sec [ALL done] walltime used = 4.00 sec


The developers have fixed a bug in the program. In addition I need to change the -only_proper_pairs flag to 0 as there are improperly paired reads in the merged file (i.e. reads located on different scaffolds). 

#### 3. Compare downsampled data to full dataset for modern pops (SFS and GL)

### 3.2. CALL VARIANTS

Call SNPs for one chromosome using bcftools mpileup and call

1. Index the reference genome

This creates a .fai file that contains chromosome name, length, start and stop positions. 

module load apps/samtools-1.9.1 samtools faidx RefGenome/*fna


2. Create a folder for calling variants for each population (modern.exp and modern.core are called together)

pwd $HOME/$SPECIES/

mkdir 03.2_Variant_calling_modern mkdir 03.2_Variant_calling_museum


3. Copy the variant calling script to each folder

Script here: [03a_variant_calling_bluecp3.sh](https://github.com/alexjvr1/Velocity2020/blob/master/03a_variant_calling_bluecp3.sh)

a) Edit the variables in the script for each species and population

b) Change permissions to execute this in your home directory. This should change the colour of the script. 

chmod u+x 03a_variant_calling_bluecp3.sh


c) And run 

./03a_variant_calling_bluecp3.sh


This  will create a regions file with all the chromosomes and scaffolds (obtained from the .fai indexed reference file), and a submission script that we'll submit to the server (*.sh)

d) Check the length of this file. 

Delete all scaffolds (we're only interested in the chromosomes for now). 

And add a header to the file. 

e) Change the number of threads in the variant calling script to match the number of chromosomes. 

This script calls variants for all individuals simultaneously per chromosome. 

f) submit to queue and check that it's running properly

qstat -u username -t


### 4. ANALYSES

Example papers

[Crested Ibis (Feng et al, Current biology, 2019)](https://www.sciencedirect.com/science/article/pii/S0960982218316099)

[Alpine chipmunks (Bi et al, PLoS Genetics, 2019)](https://journals.plos.org/plosgenetics/article?rev=2&id=10.1371/journal.pgen.1008119)

- Also good example of mapdamage use

- PopGen

- outFlank

[Dryas Monkey (ven der Valk, MBE, 2020)](https://academic.oup.com/mbe/article/37/1/183/5570178)

[Darwins finches (Pachecho, GBE, 2020)](https://academic.oup.com/gbe/article/12/3/136/5735467)

- example of using ANGSD to call GL

- diversity stats

[Lynx (Lucena-Perez, MolEcol, 2020)](https://onlinelibrary.wiley.com/doi/full/10.1111/mec.15366?casa_token=I6nUZVYLpjAAAAAA%3AjGkueGUyt5AP-RdrD2GeejFl-U2BXVlrXqmYLcck4lOfFy4Yb2jMv3r6ZxXZ9WhGYDcmVDM0oJ8l)

- Genotype calls

- Pop structure

- Diversity measures

[Killer whale ROH (BioRxiv, Andy Foote & Excoffier)](https://www.biorxiv.org/content/10.1101/2020.04.08.031344v1.abstract)

- ROH

#### 4a. Diversity stats

##### 1. Tajima's D and Watterson's theta

We can calculate Tajima's D and Watterson's theta as [diversity stats in ANGSD](http://www.popgen.dk/angsd/index.php/Thetas,Tajima,Neutrality_tests) after estimating the SFS for each population. When using folded SFS (as we have here) only the TD and WT will be meaningful despite several stats being calculated by these scripts: 

1. Calculate the SFS from SAF. -fold 1 for folded SFS when reference is unknown

~/bin/angsd/misc/realSFS NAME.saf.idx -fold 1 > NAME.saf.sfs

or use the script below and input names in the command line. Where $1 = input saf, and $2= output file name:

qsub 04a_ANGSD_realSFS_cmdlineInputs.sh -F "NAME.saf.idx NAME.saf.sfs"

2. Estimate thetas from SFS

~/bin/angsd/misc/realSFS saf2theta NAME.saf.idx -sfs NAME.saf.sfs -outname NAME.THETA.theta.gz

3. Estimate thetas in windows from the above global theta file

~/bin/angsd/misc/thetaStat do_stat NAME.theta.idx -win 50000 -step 10000 -outnames NAME.THETA.window.gz

4. Estimate theta per chromosome

~/bin/angsd/misc/thetaStat do_stat NAME.theta.idx


###### Scripts: 

Step1: [04a_ANGSD_realSFS_cmdlineInputs.sh](https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_realSFS_cmdlineInputs.sh)

##### 2. observed heterozygosity

Comparison of the number of heterozygous sites in historic vs current datasets

Calculate this in ANGSD

###### 1. Estimate the SFS for each individual separately by modifying the script [04a_ANGSD_INDIV.SAF.sh](https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_INDIV.SAF.sh) to first estimate SAF

###### 2. Estimate individual SFS from SAF 

Create a list of all the idx files for each population

cd HET.per.INDIV ls MUSidx >> MUS.idx.names ls MODEidx >> MODE.idx.names ls MODCidx >> MODC.idx.names ls MUSbaq2*idx >> MUS.idx.baq2.names ##I compared the MUS datasets with the two BAQ settings


Modify the script [04a_ANGSD_SFS.INDIV.sh](https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_SFS.INDIV.sh) and run in the same folder for each population

###### 3. Concat all files together and read into R and plot

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS

cat MUSsfs >> MUS.ALL.sfs cat MODCsfs >> MODC.ALL.sfs cat MODEsfs >> MODE.ALL.sfs cat MUSbaq2*sfs >> MUS.ALL.baq2.sfs

copy to mac

scp bluecp3:/newhome/aj18951/E3/04/HET.per.INDIV/*ALL.sfs .

in R

a<-read.table("MODC.ALL.sfs", header=F) b<-read.table("MODE.ALL.sfs", header=F) c<- read.table("MUS.ALL.sfs", header=F) d<-read.table("MUS.ALL.baq2.sfs", header=F) e<-read.table("MODC.ALL.baq2.sfs", header=F) f<-read.table("MODE.ALL.baq2.sfs", header=F)

colnames(a) <- c("Hom1", "Het", "Hom2") a$pop <- "MODC" colnames(b) <- c("Hom1", "Het", "Hom2") b$pop <- "MODE" colnames(c) <- c("Hom1", "Het", "Hom2") c$pop <- "MUS" colnames(d) <- c("Hom1", "Het", "Hom2") d$pop <- "MUS.baq2" colnames(e) <- c("Hom1", "Het", "Hom2") e$pop <- "MODC.baq2" colnames(f) <- c("Hom1", "Het", "Hom2") f$pop <- "MODE.baq2"

abc <- rbind(a,b,c,d,e,f) abc$HetFreq <- abc$Het/(abc$Hom1+abc$Het+abc$Hom2)

library(ggplot2) ggplot(abc, aes(pop, HetFreq, colour=pop)) + geom_boxplot()

Find the heterozygosity for each individual and paste into Table:Ringlet_DiversityStats.xlsx on Dropbox


##### 2b. Depth vs Fst

Estimate SFS for all three population pairs. And then 

Pop1=MODC

Pop2=MUS

Pop3=MODE

04a_ANGSD_FINAL/SFS_and_Fst

module load languages/gcc-6.1 ~/bin/angsd/misc/realSFS fst index MUS/MUS.LR761675.1.saf.idx MODE/MODE.LR761675.1.saf.idx -sfs MUS.MODE.LR75.test.fold.sfs -fstout MUS.MODE.fstout ~/bin/angsd/misc/realSFS fst stats MUS.MODE.fstout.fst.idx

-> Assuming idxname:MUS.MODE.fstout.fst.idx
-> Assuming .fst.gz file: MUS.MODE.fstout.fst.gz
-> FST.Unweight[nObs:4722251]:0.017379 Fst.Weight:0.314412

0.017379 0.314412

~/bin/angsd/misc/realSFS fst index MODC/MODC.LR761675.1.saf.idx MUS/MUS.LR761675.1.saf.idx MODE/MODE.LR761675.1.saf.idx -sfs MODC.MUS.LR75.test.fold.sfs -sfs MODC.MODE.LR75.test.fold.sfs -sfs MUS.MODE.LR75.test.fold.sfs -fstout 3pops.fstout

~/bin/angsd/misc/realSFS fst stats 3pops.fstout.fst.idx

~/bin/angsd/misc/realSFS fst stats2 3pops.fstout.fst.idx -win 50000 -step 10000 > 3pops.slidingwindow.fst

The above script provides a window-based estimate of Fst across each chromosome. 

The depth estimates need to be obtained using samtools flagstat on the original bamfiles. This script writes a depth estimate for each individual for each site

!/bin/bash

###########################################

(c) Alexandra Jansen van Rensburg

last modified 12/07/2019 05:49

###########################################

Index all bamfiles listed in bamlist

PBS -N E1.index ##job name

PBS -l nodes=1:ppn=1 #nr of nodes and processors per node

PBS -l mem=16gb #RAM

PBS -l walltime=10:00:00 ##wall time.

PBS -j oe #concatenates error and output files (with prefix job1)

PBS -t 1-48

run job in working directory

cd $PBS_O_WORKDIR

load modules

module load apps/samtools-1.8

Set up array

NAME=$(sed "${PBS_ARRAYID}q;d" bamlist)

Run script

echo "Output depth stats for ${NAME}" printf "\n"

echo "time samtools depth ${NAME} > ${NAME}.depth" time samtools depth ${NAME} > ${NAME}.depth


Here I'm extracting chromosome LR..75 and estimating individual depth: 

!/bin/bash

###########################################

(c) Alexandra Jansen van Rensburg

last modified 12/07/2019 05:49

###########################################

Index all bamfiles listed in bamlist

PBS -N LR75 ##job name

PBS -l nodes=1:ppn=1 #nr of nodes and processors per node

PBS -l mem=16gb #RAM

PBS -l walltime=10:00:00 ##wall time.

PBS -j oe #concatenates error and output files (with prefix job1)

PBS -t 1-48

run job in working directory

cd $PBS_O_WORKDIR

load modules

module load apps/samtools-1.8

Set up array

NAME=$(sed "${PBS_ARRAYID}q;d" bamlist)

Run script

&& ensures line is complete before starting next command.

echo "Output depth stats for ${NAME}" printf "\n"

echo "time samtools view -b ${NAME} LR761675.1 > ${NAME}.LR75.bam" &&\ time samtools view -b ${NAME} LR761675.1 > ${NAME}.LR75.bam && \ echo "time samtools index ${NAME}.LR75.bam" && \ time samtools index ${NAME}.LR75.bam && \ echo "time samtools depth ${NAME}.LR75.bam > ${NAME}.LR75.depth" &&\ time samtools depth ${NAME} > ${NAME}.LR75.depth


For the modern sampels I'm using the *flt.bam* files as they have been filtered for > PHRED 20 quality scores. 

We want to plot variance in depth per site vs Fst and nucleotide diversity. So we need to join each of the depth files together within each pop (e.g. a MUS depth file), where each column is an individual, and each row is a site. At the same time we need to add missing data or gaps where loci do not co-occur between populations. This can all be done in R: 

Multi-merge script from [here](https://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/)
Modified to read tables and to merge (all=T) by including NA for all the missing data (otherwise only the intersecting data will end up in the final file). 

multmerge = function(mypath){ filenames=list.files(path=mypath, full.names=TRUE) datalist = lapply(filenames, function(x){read.table(file=x,header=T)}) Reduce(function(x,y) {merge(x,y) all=T}, datalist)

After running the code to define the function, you are all set to use it. The function takes a path. This path should be the name of a folder #that contains all of the files you would like to read and merge together and only those files you would like to merge. With this in mind, I #have two tips:

1. Before you use this function, my suggestion is to create a new folder in a short directory (for example, the path for this folder could be #“C://R//mergeme“) and save all of the files you would like to merge in that folder.

2. In addition, make sure that the column that will do the matching is formatted the same way (and has the same name) in each of the files.


##### 3. nucleotide diversity in windows

ThetaD from ANGSD window-based approach

Fig 3a in Feng et al: Distribution of nucleotide diversity across chromosomes in old vs new in 5Mb windows using nuc.div function in pegas

##### 4. Runs of Homozygosity 

Runs of homozygosity: Dryas monkey MS

Convert output to plink genotype calls: http://www.popgen.dk/angsd/index.php/Plink

Only use indivs with mean depth >> 3x? 

Use Plink to look for ROH in sliding window blocks of 50SNPs with one SNP per xkb (50kb?)
"allow for a maximum of one heterozygous and five missing calls per window before we considered the ROH to be broken"

##### 5. IBD in windows across chromosomes

Fig 3b in Feng et al: Distribution of IBD across chromosomes

##### 6. Deleterious load

Deleterious load: See Feng et al. 2019 method

#### 4b. Population structure

NJ phylogeny (Feng et al)

PCA [using ANGSD](https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.5231). 

Use [PCAngsd](http://www.popgen.dk/software/index.php/PCAngsd) for this. We need Beagle output from ANGSD

For this we want to estimate genotypes (-GL 1) with confidence (-SNP_pval xx) with Beagle output (-doGlf 2)

Basic options used: 

time $angsd -b $POPLIST -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 \ -only_proper_pairs 1 -r ${REGION} -GL 1 -doGlf 2 -out PCAngsd/GL.ALL.${REGION}\ -doSaf 1 -ref ../RefGenome/fna -anc ../RefGenome/fna -rmTriallelic 1 \ -doCounts 1 -dumpCounts 2 -doMajorMinor 4 -doMaf 1 -skipTriallelic 1 \ -setMinDepthInd $minDP setMaxDepth $MAXDP -baq 1 -C 50 \ -SNP_pval 0.001 -doGeno 32 -doPost 1

Where:

POPLIST = all of the bam files for all of the populations REGION = chromosome as an array script


I initially called genotypes for all 3 populations together, but as the GLs are then based on all bam files, I ended up with equal likelihoods for all three genotypes for the majority of the loci. This approach is also fundamentally flawed as the likelihood of the museum data cannot be based on the modern data. 

My new approach is to estimate the GL files for each of the populations separately and then combine the GL files for the loci found in all three datasets. 

I will try this with different filter sets. The -doMM 1 and -doMaf 1 should be the strictest settings. 

1. p-val 0.01, doMajorMinor 4, doMaf 2

2. p-val 0.05; -doMM 1; -doMaf 2

3. p-val 0.05; -doMM 4; -doMaf 1

4. p-val 0.05; -doMM 1; -doMaf 1

5. p-val 0.01; -doMM 1; -doMaf 1

-doMaf 1 vs 2 

-doMM 1 vs 4

Before concatenating these files I checked the proportion of equal likelihood genotypes in each dataset. This seemed unaffected by the filter set, and was high for all three datasets. 

These are results for LR7616175.1

MODC: ~50k loci; 38% equal likelihoods

MODE: ~57k loci; 29% equal likelihoods

MUS: ~5.5k loci; 58% equal likelihoods

Count the number of loci. LR for chromosomes CADX for contigs

grep LR *beagle |wc -l

Check the number of indivs in the beagle file by looking at the top of the file

Total number of GLs = #loci x #indivs x 3

Count number of occurrences of equal likelihood

grep -i -o "0.333" *beagle |wc -l

frequency = #occurrences/Total GLs


##### Filter Beagle GLs

https://faculty.washington.edu/browning/beagle_utilities/utilities.html

If a genotype has less than xx likelihood it's marked as missing data. But how does this then affect the PCA? 

#### 4c. Outliers, Map, and identify candidates in the area. 

PCAngsd as outlier approach?? Does this make sense? 

##### Fst in windows to find outlying regions in ANGSD

ANGSD generates a ML SFS from genotype likelihoods for each population. It then uses the SFS as a prior in an empirical Bayes approach to estimate the posterior probability for all possible allelic frequencies at each site. (Method: Fumagalli et al. 2013) 

Plot Time vs space Fst plots for each chromosome. I'm using the Ringlet chromosome names as in the genome MS. Each chromosome is shown in two plots - plot1=MODC vs MODE, plot2= MODC vs Mus. 

Modern graphs

p1 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761647.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("Mod Core vs Mod Expanding LR761647.1") + theme(axis.title.x = element_blank())

p2 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761648.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("Mod Core vs Mod Expanding LR761648.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p3 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761649.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761649.1") + xlim(0,18830000) + theme(axis.title.x = element_blank())

p4 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761650.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761650.1") + xlim(0,18830000) + theme(axis.title.x = element_blank())

p5 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761651.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761651.1") + xlim(0,18830000) + theme(axis.title.x = element_blank())

p6 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761652.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761652.1") + xlim(0,18830000) + theme(axis.title.x = element_blank())

p7 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761653.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761653.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p8 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761654.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761654.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p9 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761655.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761655.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p10 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761656.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761656.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p11 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761657.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761657.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p12 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761658.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761658.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p13 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761659.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761659.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p14 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761660.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761660.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p15 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761661.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761661.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p16 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761662.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761662.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p17 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761663.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761663.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p18 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761664.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761664.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p19 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761665.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761665.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p20 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761666.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761666.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p21 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761667.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761667.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p22 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761668.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761668.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p23 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761669.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761669.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p24 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761670.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761670.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p25 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761671.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761671.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p26 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761672.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761672.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p27 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761673.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761673.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p28 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761674.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761674.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())

p29 <- ggplot(fstMODC.MODE[which(fstMODC.MODE$chr=="LR761675.1"),], aes(x=midpoint, y=fst)) + geom_point() + ggtitle("LR761675.1") + xlim(0,18830000)+ theme(axis.title.x = element_blank())



###### Synteny between modern and museum samples. 

BLAST with FlyBase & Enrichment analysis using PANTHER  

See [this](https://onlinelibrary.wiley.com/doi/full/10.1111/mec.15188?casa_token=X-WHMot7TDcAAAAA%3Abn7IwwiinA44JDoEU-yuVV3iLk4RkXwcCU1av3_hKRG1hgKDNaCzPHbrEGlRCBk5j8bMcIW6ynjT) example paper

#### 4d. LD analyses

ngsLD (https://github.com/fgvieira/ngsLD) and plot r2 estimates using fit_LDdecay.R

Can use GL as input

See pigeon paper 

#### 4e.