Velocity2020

NERC Velocity Project analyses using Sanger genomes

Here I'll curate the variant calling pipeline and analyses undertaken using the Lepidoptera genomes generated by Sanger in 2019/2020.

Darwin Tree of Life (DToL) data - live results

Pipeline

1. Raw to cleaned and processed data

1a. Trim adapter sequence using cutadapt

1b. Concatenate resequenced museum data (some individuals have been sequenced >1)

1c. Repair problems in museum PE data for data from 1.2. (BBrepair)

1d. Merge overlapping PE reads in museum data (BBmerge)

2. Map and process

2a. Map museum and modern data to Sanger genome

2b. Correct museum data for possible deamination (MapDamage -> output = corrected bam file)

2c. Downsample modern data to the same depth as the museum data

3. SNP discovery and filtering

3.1 ANGSD

3.1.a. ANGSD filters for SFS (ie. no MAF)

3.1.b. [ANGSD filters for population genomics]

3.2 VARIANT CALLING

3.2.1 [Call variants with bcftools mpileup and call]()

4. Analyses: Outliers

4a. [Outlier analysis in ANGSD]

4b. [Manhattan plot]

4c. [Table of functions]

4d. [Network analysis]

5. Analysis: Genetic diversity and population structure

6. Analysis: LD

6a. [ANGSD estimate LD across the genome]

DATA: Genome

Aphantopus hyperantus (Ringlet) was the first genome available (NCBI link), so the pipeline will be set up with this species.

DATA: WGS

Whole genome resequencing data was generated for 38 & 40 modern individuals (sampled 2016-2017 & 2019) from a core and expanding population. Museum data was generated from 48 individuals + resequencing of a subset of individuals to increase read coverage.

Note on renaming files

Liverpool raw data is named with digits and a dash before the sample names. e.g. 33-AH-01-1900-47_191121_L001_R2.fastq.gz The easiest way to rename them is with the Perl rename (note that the native linux rename works the same as mv and is not so useful in this case). Install the perl script (a version curated here). On bluecp3 I've installed this in my software folder: /newhome/aj18951/software/rename-master/rename.

This works with the sed syntax. e.g. this will remove numbers plus dash from the start of the file names:

../../software/rename-master/rename 's/^[0-9]+-//' *

Pipeline for Velocity project from raw data to mapped reads:

1. Raw to cleaned and processed data

1a Demultiplex and Adapter trimming

TIME:

This runs in 1-2 hours for the full dataset (museum + modern)

METHOD:

Modern samples arrive demultiplexed by the sequencing facility, but Museum samples need to be demultiplexed.

We're trimming all adapter sequence from the demultiplexed data. We're also removing all sequences that are shorter than 20bp and 3' quality trimmed to remove bases with PHRED quality score of < 20 with Cutadapt.

If you're running this on BlueCrystal, you'll have to install cutadapt locally first using [these instructions] (https://cutadapt.readthedocs.io/en/stable/installation.html) - see below. We're using cutadapt version 3.4:

module load languages/python-anaconda3-5.2.0

module load tools/cmake-3.8.1
module load tools/autoconf-2.69
module load languages/gcc-9.1.0
module add tools/nasm-2.15.05

#install cutadapt in your home directory using the web instructions
pip3 install --user --upgrade cutadapt

#Check that this cutadapt works
~/.local/bin/cutadapt --help

##Check if this directory is in your PATH:
echo $PATH

##And add to PATH if it isn't yet
PATH="$PATH:~/.local/bin/"

##Now you can run cutadapt directly
cutadapt --help

cutadapt version 3.4

Generate submission script

Run the following scripts to generate the submission script

01a_museum_cutadapt_filtering_trimming.sh

01a_modern_cutadapt_filtering_trimming.sh

Submission script

The above scripts will generate this:

01a_parallel_cutadapt_bluecp3.sh

Edit this submission script to submit from your home directory:

1. Set all paths to your home directory if necessary. 

2. Adjust the number of threads (PBS -t 1-xx) to equal the number of individuals to be analysed. 

3. Check that any empty arguments have been removed from the cutadapt command

4. You might have to set the path to cutadapt to find your local version

1b Concatenate museum reseq data

TIME

~30-40min

METHOD

A subset of individuals (33 per species) have been sequenced twice to increase mean depth. The data from both sequencing runs need to be concatenated together after adapter trimming. We're using these scripts:

1b_concat.fastq.R1.sh and 1b_concat.fastq.R2.sh

Reseq data are kept in the following folders:

00_raw_data_museum2

01a_museum2_cutadapt_reads

01a_mus.concat_cutadapt_reads  ## concatenated museum1 and museum2 + all samples that didn't have reseq data added. I'll point to this folder when mapping

02a_museum_mapped  ##see below. This contains all data including concatenated reseq samples.

Rename samples

We don't need the really long names given to the samples by the sequencing facilities. We'll rename samples before proceeding further:

Museum

cd 01a_mus.concat_cutadapt_reads/

#move all the log files into a folder
mkdir logfiles
mv *log logfiles

#change the names
~/software/rename_master/rename 's/long_name/new_name/' *gz

##remove all the extra info about lane number and date etc. Final names will bein this format: 

AH-01-1900-01_mus_R1.concat.fastq.gz  AH-01-1900-17_R2.fastq.gz         AH-01-1900-34_mus_R1.concat.fastq.gz
AH-01-1900-01_mus_R2.concat.fastq.gz  AH-01-1900-18_R1.fastq.gz         AH-01-1900-34_mus_R2.concat.fastq.gz

Modern

I'm mapping simultaneously to mod.core and mod.exp, but I'll keep the cutadapt reads in their different folders. 

cd 01a_modern_cutadapt_reads
#move all the log files into a folder
mkdir logfiles
mv *log logfiles

#change the names

~/software/rename-master/rename 's/R1_001.fastq.gzcutadapt_filtered/mod.core/' *gz
~/software/rename-master/rename 's/R2_001.fastq.gzcutadapt_filtered/mod.core/' *gz

AH-01-2016-01_mod.core_R1.fastq.gz  AH-01-2016-16_mod.core_R1.fastq.gz  AH-01-2017-29_mod.core_R1.fastq.gz
AH-01-2016-01_mod.core_R2.fastq.gz  AH-01-2016-16_mod.core_R2.fastq.gz  AH-01-2017-29_mod.core_R2.fastq.gz

cd  01a_modern.exp_cutadapt_reads
#move all the log files into a folder
mkdir logfiles
mv *log logfiles

~/software/rename-master/rename 's/R1_001.fastq.gzcutadapt_filtered/mod.exp/' *gz
~/software/rename-master/rename 's/R2_001.fastq.gzcutadapt_filtered/mod.exp/' *gz

AH-02-2019-42_mod.exp_R1.fastq.gz  AH-02-2019-57_mod.exp_R1.fastq.gz  AH-02-2019-71_mod.exp_R1.fastq.gz
AH-02-2019-42_mod.exp_R2.fastq.gz  AH-02-2019-57_mod.exp_R2.fastq.gz  AH-02-2019-71_mod.exp_R2.fastq.gz

1c Prepare museum data for MapDamage (2.2): Repair PE reads

TIME

~1hour

METHOD

Museum data is prone to post-mortem damage which we will correct for using MapDamage (see 2.2 below). We need to pre-process the museum reads for this. First we need to correct any problems with the PE files. The most common problem we've found is mismatches between R1 and R2 files e.g. not equal in length or mismatches between names. We can correct for this using BBtools's repair.sh script.

BBtools can be installed from the downloaded tarball

We have it installed here

/newhome/bzzjrb/Software/bbmap/

We'll repair the museum data using this script: 01c_bbtools_repair_museum_ARRAY.sh

The input files will be all the museum fastq files found in 01a_mus.concat_cutadapt_reads

Create two files with names for the inputs in the home directory for the species (e.g. /E3_Aphantopus_hyperantus_2020/):

ls 01a_mus.concat_cutadapt_reads/*R1*fastq.gz >> R1.museum.names.torepair
ls 01a_mus.concat_cutadapt_reads/*R2*fastq.gz >> R2.museum.names.torepair

##remove the file path from the name files. We need only in the input file names. 
sed -i 's:01a_mus.concat_cutadapt_reads/::g' *torepair

This will write all the repaired files to: 01c_musPERepaired

#make the output directory within the species home directory:
mkdir 01c_musPERepaired

Check that the correct number of threads are specified (e.g. 48 indivs: PBS -t 1-48)

And submit:

qsub 01c_bbtools_repair_museum_ARRAY.sh

1d Prepare museum data for MapDamage (2.2): Merge overlapping PE reads

TIME

30-40min

METHOD

MapDamage requires that overlapping PE reads be merged. Our museum libraries were sequenced with 75bp PE kits, except for the final (museum4) library which was sequenced with 50bp PE.

We know from the library prep that the museum DNA is quite degraded. Thus we expect that the vast majority of the inserts would have overlapping PE sequences.

We can count and merge these using bbtools' merge script.

Edit 01d_bbtools_merge_museum_ARRAY.sh to set up the array submission script.

Make the output directory and we need the names files again:

mkdir 01d_musAll_merged

ls 01c_musPERepaired/*R1*gz >> R1.museum.names.repaired
ls 01c_musPERepaired/*R2*gz >> R2.museum.names.repaired

sed -i 's:01c_musPERepaired/::g' *repaired

This outputs 3 files for each individual: a merged file and an unmerged R1 and R2 file. We will proceed only with the merged reads at this point.

As these are no longer PE data, we can map as if we had a single read.

2. Map and process

Map museum and modern samples to the Sanger reference genome. Thereafter correct the museum bam files using MapDamage, and downsample the modern data to the same final depth as the corrected museum bam files.

2a Map to Reference Genome

TIME:

Museum samples (n=48) ~3 hours

Modern Core (n=38) ~10 hours

Modern Exp (n=40) ~10 hours for all but two samples which had to be restarted. They then ran in 4 hours...

METHOD:

It is more efficient to run this code in local directory before submitting the mapping script to queue

#Index the reference genome if needed. Check if the *fasta.fai* file exists in the SpeciesName/RefGenome/ folder in your local directory. If not, run the indexing code. 

#index reference genome
module load apps/bwa-0.7.15
bwa index RefGenome/*fasta

#Create files with input names

## museum
##Repaired and merged = 1 file per indiv
ls 01d_musAll_merged/*repaired.merged*fastq >> mus.merged.names

sed -i s:01d_musAll_merged/::g merged.museum.names

## modern
## Cutadapt and adapter trimmed = 2 files per indiv
ls 01a_modern_cutadapt_reads/*R1_paired* >> R1.modern.names
ls 01a_modern_cutadapt_reads/*R2_paired* >> R2.modern.names

sed -i 's:01a_modern_cutadapt_reads/::g' *names
#Optional 
#If we want to keep the core and expanding samples in separate folders we can leave the path to the indivs in the *names file

#make output directories. 
mkdir 02a_museum_mapped
mkdir 02a_modern_mapped

#Optional
#Add the additional folders in the 02a_modern_mapped folder when working with an EXPANDING species as the output will be written there. 
mkdir 02a_modern_mapped/01a_modern_cutadapt_reads
mkdir 02a_modern_mapped/01a_modern.exp_cutadapt_reads

#Check that you're pointing to the correct reference genome

#Check that the file separator makes sense: 
##sample_name=`echo ${NAME1} | awk -F "_R" '{print $1}'`
#Change the -F "xxx" according to the file names. 
#e.g the above works for files named as follows: 
#HS-01-2016-26_L007_cutadapt_filtered_R2.fastq.gz
#we want only the first part of this name to carry through.

Run the submission scripts:

02a_MapwithBWAmem.ARRAY_museum.merged.sh

02a_MapwithBWAmem.ARRAY_modern.sh

If we don't merge the museum reads, we can use this script to map the PE reads:

02a_MapwithBWAmem.ARRAY_museum.sh

Check that everything has mapped correctly by checking the file sizes. If the mapping is cut short (e.g. by exceeding the requested walltime) the partial bam file will look complete and can be indexed. But the bam file size will be small (~500kb) and empty when you look at it.

#To determine file size

du -sh *bam   

#To see bam file
module load apps/bcftools-1.8
bcftools view file.bam | head
Check the output with samtools flagstat

module load apps/samtools-1.9.1
samtools flagstat file.bam

#make a flagstat log file for all of the samples
for i in $(ls *bam); do ls $i >>flagstat.log && samtools flagstat $i >> flagstat.log; done

Index the bam files with the script 02a_index.bamfiles.sh

2b. MapDamage run on museum data

TIME

3-4 hours for 48 samples

METHOD

MapDamage2 is a package used to estimate and correct for Cytosine deamination (or any other transition/transversion bias in the data). This is a problem anticipated for ancient DNA, and possibly for museum data.

This needs to be locally installed on BlueCrystal. Follow the git install instructions in the tutorial here

As the R libraries are loaded into the tmp memory we need to reinstall this every time we want to use MapDamage (after logging off)

If there are any issues, the easiest thing to do is to reinstall MapDamage.

module load languages/R-3.6.3-gcc9.1.0
languages/python-3.8.5

R

install.packages("inline")
install.packages("gam")
install.packages("Rcpp")
install.packages("ggplot2")
install.packages("RcppGSL")

MapDamage will be run in the 02a_museum_mapped folder.

Create a file listing all the bamfiles

ls *bam > bamlist


2. Copy the script [02b_mapDamage_museum.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02b_mapDamage_museum.sh) to the 02a_museum_mapped folder. Change the job name, the number of threads, and check the path to the reference genome.

Submit to queue.

Move all the new rescaled bam files to a new folder:

mkdir 02b_museum_mapDamage mv 02a_museum_processed.mapped/results/bam 02b_museum_mapDamage && cd 02b_museum_mapDamage

module load apps/samtools-1.9 for i in $(ls *bam); do ls $i >> flagstat.log && samtools flagstat $i >> flagstat.log; done


And index all the bams using the previous indexing script

cp ../02a_museum_mapped/02a_index.bamfiles.sh

ls *bam > bamlist

qsub 02a_index.bamfiles.sh


##### ->>>  To call variants skip to section 3.2

#### 2c. Downsample modern data to the same coverage as in the museum samples

Due to the difference in sample quality between museum and modern samples, mean coverage is much higher for the modern data. This may bias the confidence in variant calls downstream. To avoid this problem I will downsample the modern data to the same mean depth as the museum data.

First filter the bam files to include only reads with PHRED quality >20 and properly paired reads using the [02c_Filter_modern_bam_pp.PHRED20.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02c_Filter_modern_bam_pp.PHRED20.sh) script.

We'll need the names file:

ls 02a_modern_mapped/*bam >> bamfiles.mod.names sed -i 's:02a_modern_mapped/::' bamfiles.mod.names


Use samtools flagstat to calculate the number of properly paired reads in the recalibrated and filtered museum files.

module load apps/samtools-1.8 for i in $(ls results/flt.bam); do ls $i >> mus.flagstat.log && samtools flagstat $i >> mus.flagstat.log; done

Do the same for the modern samples.

Enter these data in the "Rescaled.ProperlyPaired.Q20" column in the Velocity_MapingStatsPerSpecies_AJvR_20190604.xlsx sheet on Dropbox. Calculate the mean number of museum reads and the proportion of modern reads to downsample to.

Use the [02c_Downsample_mod_ARRAY.sh](https://github.com/alexjvr1/Velocity2020/blob/master/02c_Downsample_mod_ARRAY.sh) script to downsample the modern bam files. Remember to change the job name and the PROP variables and create the input file listing all the modern bams.

### 3.1. ANGSD

There seem to be improperly paired reads in my final dataset. To deal with this I need to change the ANGSD flag -only_proper_pairs to 0. 

Also make sure to use the latest commit of ANGSD. 

I'm using:

angsd version: 0.933-18-gfd1a21a (htslib: 1.10.2-61-g8859b09) build(May 6 2020 14:42:05)


#### Set up analysis

To speed up the analysis I will split the ANGSD run across the genome; i.e. all indivs will be analysed for regions 0-x in ARRAY1, x-x1 in ARRAY2, etc. For this we need to split the genome up into regions.

Find all the regions (i.e. all chromosomes and contigs) from the reference index file (.fasta.fai):

awk '{print $1}' ../RefGenome/*.fna.fai >> regions

cat regions |wc -l

87

We'll run these in an ARRAY.


The shortest contig is 11048 and longest is 18856181 for Ringlet

###### *Filters*

I'm starting with bam files, so there are already some filters on the mapping quality of the sequences. Prior to that there are some crude filters during the demultiplexing and trimming steps.

Other possible filters: 

-b[filelist]

-remove_bads 1 : remove reads with 255 flag (not primary, failure and duplicate reads) (1=default)

-uniqueOnly 1 : remove reads with multiple best hits

-minMapQ 20 : PHRED 20. This should already be in place during the mapping.

-minQ 20 : PHRED 20 for individual base score.

-only_proper_pairs 0 : NBNB THIS flag is changed to 0 because some of the reads in my final mus files are not properly paired! 
***OLD FILTER include only properly paired reads (default) and should already have been applied to the museum reads prior to this.  

-trim 0 : We're not trimming any data

-baq 1 : estimate base alignment quality using samtools method. BAQ1 and BAQ2 give me very different final variant counts. BAQ1 is more stringent, but could be removing more loci. BAQ2 has a higher number of false positives. See [here](https://www.biostars.org/p/440490/), and discussion on the ANGSD forum [here](https://github.com/ANGSD/angsd/issues/106) and [here](https://github.com/ANGSD/angsd/issues/97)

###ALLELE FREQUENCY ESTIMATION

-doMajorMinor 4 : Force Major allele based on reference. The minor allele is then inferred using doMajorMinor 1. This option needs to be used when calculating SFS for multiple populations as ANGSD otherwise determines a minor allele within each population. I.e. this may not be the same across all the populations.

-ref [..fasta] : For doMM 4 above we need to specify a reference genome.

-doMaf 10 : calculate minor allele. Estimators are added to gether, so doMaf 10 = doMaf 2 (Known major, unknown minor) + doMaf 8 (directly from counts) 

-doPost 1: calculate posterior probabilities using allele frequencies as prior (or 2 for uniform prior, or 3 for SFS prior)

-SNP_pval 0.05 : Only work with SNPs with a p-value above [float]  ** I'm using 0.05 here because too many loci are removed from the MUS dataset at higher p-values. (see [04a_ANGSD_testFilters.md](https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_testFilters.md))

-GL 1 : I will estimate genotype likelihoods using the SAMtools model

-minInd 18 : I will remove loci where less than 18 individuals have been genotyped. There are 35-40 indivs per group (MUS, MODC, MODE). This number was chosen to ensure we can find MAF of 2-5% (1/(18*2)=2.8%)   

 * I excluded the minInd filter in the final dataset as genotype likelihoods should take this into account. 

-setMinDepth : Discard site if total depth (across all indivs) is below [int]. Use -doCounts to determine the distribution of depths

-setMaxDepth : Discard site if total depth (across all indivs) is above [int]  ##I'll use mean + 2*SD

-setMinDepthInd 2 : Minimum depth for a locus for an individual. This is only applicable for analyses using counts (-doCounts obligatory)

-setMaxDepthInd [int]: I'll use a max of meanDP + 2xSD of depth.

-doCounts 1 : Count of the nucleotide bases per individual

-dumpCounts 2 : write a file with all the allele counts per position per individual

-rmTriallelic 1 : include only biallelic loci

-checkBamHeaders 1 : check that the bam headers are compatable for all files.

#### MaxDepth

To estimate the setMaxDepth filter per population, I first need to write the depth per site in ANGSD. 
Using the prepared SFS script add in the following:

-dumpCounts 1 ##This will write global depth count per site in a .pos file


Calculate the mean and SD

pwd /Users/alexjvr/2020.postdoc/Velocity/E3/ANGSD_FINAL/DepthEstimates/DepthPerBP MODE.LR761675.1.OCT14.pos.gz MODC.LR761675.1.OCT14.pos.gz MUS.LR761675.1.OCT14.pos.gz

R

MODC.pos <- read.table(gzfile("MODC.LR761675.1.OCT14.pos.gz"), header=T) MODE.pos <- read.table(gzfile("MODE.LR761675.1.OCT14.pos.gz"), header=T) MUS.pos <- read.table(gzfile("MUS.LR761675.1.OCT14.pos.gz"), header=T)

mean(MODC.pos$totDepth)+(2(sd(MODC.pos$totDepth))) [1] 179.6283 mean(MODE.pos$totDepth)+(2(sd(MODE.pos$totDepth))) [1] 293.1328 mean(MUS.pos$totDepth)+(2*(sd(MUS.pos$totDepth))) [1] 80.07606


#### 3a. ANGSD filters for SFS

It's important to include all loci in the SFS. Don't include any filters that'll affect the allele frequencies (e.g. MAF or SNP_pval filters). 

It's also important to polarise the Major/Minor alleles so that they're the same between the populations. 

This is done with -doMajorMinor 4 : Force Major allele based on reference. The minor allele is then inferred using doMajorMinor 1 (i.e. from GL). This option needs to be used when calculating SFS for multiple populations as ANGSD otherwise determines a minor allele within each population. I.e. this may not be the same across all the populations.

The order of the filters do not seem to affect the outcome (see https://github.com/alexjvr1/Velocity2020/blob/master/04a_ANGSD_testFilters.md)

To add depth filters we need to add -doCount 1. 

NBNB for MUS data we have to select only_proper_pairs 0 - this is because the processing with MapDamage changes the reads in some way so that they all get filtered out if we select only_proper_pairs 1. They should already be filtered for proper_pairs based on prior mapping filters. 

The [basic process is](https://github.com/ANGSD/angsd/issues/259): 

###### 1. Estimate SAF for each population (unfolded) 

MUS

Need gcc to be loaded

module load languages/gcc-9.1.0

ANGSD located here:

/newhome/aj18951/bin/angsd/angsd

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MUS.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MUS.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 0 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 144 -doMajorMinor 4 -out MUS -C 50 -baq 1

-> Total number of sites analyzed: 4786628 -> Number of sites retained after filtering: 4762287


MODC

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MODC.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MODC.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 328 -doMajorMinor 4 -out MODC -C 50 -baq 1

-> Total number of sites analyzed: 5715589 -> Number of sites retained after filtering: 5621808


MODE

/newhome/aj18951/E3_Aphantopus_hyperantus_2020/04a_ANGSD_TESTS/04_SFS/04a_MODE.SAF.sh

/newhome/aj18951/bin/angsd/angsd -b MODE.poplist -checkBamHeaders 1 -minQ 20 -minMapQ 20 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -r LR761675.1: -GL 1 -doSaf 1 -anc ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -ref ../RefGenome/GCA_902806685.1_iAphHyp1.1_genomic.fna -doCounts 1 -setMinDepthInd 2 -setMaxDepth 621 -doMajorMinor 4 -out MODE -C 50 -baq 1

-> Total number of sites analyzed: 5688111 -> Number of sites retained after filtering: 5553623


###### 2. unfolded SAF used to produce folded 2D SFS

Then generate the folded SFS for each population pair per chromosome. If the dataset is smaller this can be run on a merged dataset (realSFS cat xx -fold 1 -outnames pop.saf.out), but this is too memory intensive for the whole genome datasets.

realSFS pop1.unfolded.saf.idx pop2.unfolded.saf.idx -fold 1 >folded.sfs

~/bin/angsd/misc/realSFS MODC.saf.idx MODE.saf.idx -fold 1 > MODE.MODC.fold.sfs

~/bin/angsd/misc/realSFS MODC.saf.idx MUS.saf.idx -fold 1 > MUS.MODC.fold.sfs


##### 3. unfolded SAF and folded SFS used to generate per-site numerator/denominator of Fst (use -fold in realSFS fst index)

realSFS fst index pop1.unfolded.saf.idx pop2.unfolded.saf.idx -sfs folded.sfs -fold 1 -fstout persite

~/bin/angsd/misc/realSFS fst index MODC.saf.idx MUS.saf.idx -sfs MUS.MODC.folded.sfs -fstout MUS.MODC.fstout ~/bin/angsd/misc/realSFS fst index MODC.saf.idx MODE.saf.idx -sfs MODE.MODC.folded.sfs -fstout MODE.MODC.fstout


##### 4. sum numerator and denominator in windows

realSFS fst stat2 persite.fst.idx -win XXXX -step XXXX >window.fst

~/bin/angsd/misc/realSFS fst stats2 here.MUS.MODC.fst.idx -win 50000 -step 10000 > slidingwindow.MUS.MODC ~/bin/angsd/misc/realSFS fst stats2 MODE.MODC.persite.fst.idx -win 50000 -step 10000 > MODE.MODC.slidingwindow


##### GlobalFST 

1. MODC vs MODE

~/bin/angsd/misc/realSFS fst stats MODE.MODC.persite.fst.idx

-> FST.Unweight[nObs:5498007]:0.020322 Fst.Weight:0.126046 0.020322 0.126046


2. MODC vs MUS

~/bin/angsd/misc/realSFS fst stats MUS.MODC.fstout.fst.idx -> Assuming idxname:MUS.MODC.fstout.fst.idx -> Assuming .fst.gz file: MUS.MODC.fstout.fst.gz -> FST.Unweight[nObs:4738713]:0.091403 Fst.Weight:0.234494 0.091403 0.234494


##### Diversity stats with ANGSD (theta)

MODE

~/bin/angsd/misc/realSFS MODE.saf.idx -fold 1 > MODE.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MODE.saf.idx -sfs MODE.fold.sfs -outname MODE.fold.thetas

MODC

~/bin/angsd/misc/realSFS MODC.saf.idx -fold 1 > MODC.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MODC.saf.idx -sfs MODC.fold.sfs -outname MODC.fold.thetas

MUS

~/bin/angsd/misc/realSFS MUS.saf.idx -fold 1 > MUS.fold.sfs ~/bin/angsd/misc/realSFS saf2theta MUS.saf.idx -sfs MUS.fold.sfs -outname MUS.fold.thetas


##### Depth vs Fst and Depth vs nucleotide diversity

I'm concerned that the sequencing depth of the samples might affect the final Fst and nucleotide diversity estimates. This is particularly problematic given the difference in sequencing depth between museum and modern samples. Here I plot depth vs Fst and depth vs nucleotide diversity for each of the datasets. 

Nucleotide diversity and Fst have been estimated above. 

###### 1. Obtain depth estimates: 

ANGSD estimates depth using the -doCounts 1 -doDepth 1 -dumpCounts 2

.counts.gz = depth per individual per site. Columns = indivs, Rows = sites
.pos.gz = totalDepth per site across all indivs. 

See [03_DepthEstimate.md](https://github.com/alexjvr1/Velocity2020/blob/master/03_DepthEstimate.md) for my estimates of depth across the different datasets and how I chose my minDP and maxDP filters. 

#### 2. Call GL

Test dataset: 
GL are called using the samtools method (GL1), and we're only allowing SNPs with a p-value of 0.001 for modern samples, and 0.05 for museum samples. 

Total sites analyzed and kept for each dataset: 

MOD Core p0.001 per SNP