Open eriqande opened 2 years ago
bcf_file
scaff_grp
sample_subset
bcftools_opts
We could use a decent test data set for this. I have 480-something chinook, filtered to maf 0.01 that is about 31 Gb, and has around 14 million variants in it. If we sampled at a rate of 0.0068 we would end up with about 100,000 markers, but that would still be about 210 Mb. So, let's cut it down to 10,000 markers and around 22 Mb.
mamba create -n vcflib-1.0.3 -c bioconda vcflib
conda activate vcflib-1.0.3
module load bio/bcftools
bcftools view pass-maf-0.01.bcf | vcfrandomsample -r 0.00068 | bcftools view -Ob > /tmp/test.bcf
Ha! That is abominably slow! Granted it is streaming through the whole file to pick out less than 1 in a 1000.
So, while waiting for that, I try it a different way:
# print every 1000-th marker to a file:
bcftools query -f '%CHROM\t%POS\n' pass-maf-0.01.bcf | awk 'NR%1000==0' > /tmp/get-these.txt
# that should get us about 14000 markers.
# then, I will take only half of those (get us down to about 7000 markers)
awk 'NR%2==0' get-these.txt > markers.txt
# and then we can access them via the index with the -R option
bcftools view -Ob -R /tmp/markers.txt pass-maf-0.01.bcf > ~/test.bcf
That got done by the time the vcflib streaming approach was still on marker
The resulting BCF file is 19 Mb. Perhaps still too big for a real .test data set, but I can start playing with it now on my laptop, and on SEDNA.
I am going to make these from the chromosomes and scaff groups.
library(tidyverse)
chrom <- read_tsv("~/Documents/projects/yukon-chinookomes/chromosomes.tsv") %>%
mutate(id = as.character(1:n(), .before = chrom)
) %>%
rename(stop = num_bases)
sg <- read_tsv("~/Documents/projects/yukon-chinookomes/scaffold_groups.tsv") %>%
select(id, chrom, len) %>%
rename(stop = len)
both <- bind_rows(chrom, sg) %>%
mutate(
id2 = sprintf("scaff_grp_%04d", as.integer(factor(id, levels = unique(id))))
) %>%
mutate(id = id2) %>%
mutate(start = 1) %>%
select(id, chrom, start, stop)
write_tsv(both, ".test/bcf/test.bcf.scaff_groups.tsv")
I decided that it would be better to have all the filtering up front in bcftools, and then we can do thinning from there. Complete rewrite, but I think it will serve us better.
So, the idea here is that we have a single file, bcfs.tsv
that gives the columns
{bcf_id}
So, the idea is that the {bcf_id}
is what we use to name all these in the paths. So, an id
should be
something informative, like biallelic-SNPs-32-Yukon-Chinook-maf-0.05
or whatever. We don't rely
on that id to give us all the information, because there will be an info
directory that has that information
in it.
Then, thinned versions of this
So, the output directory structure will look something like this:
results/bcf # top level directory into which all of these bcf files will go
└── {bcf_id}
├── info
│ ├── info.txt # tsv of the different options that created this file like: tag:\titem
│ ├── bcf_stats.txt # the bcftools stats output from main
│ ├── positions.tsv.gz # gzipped CHROM\tPOS
│ ├── sample_subset.txt # The sample subset file that was applied
│ └── samples.txt # the actual samples in the files obtained by bcftools query -l
│ └── scaffold_groups.tsv # Copy the scaffold_groups file here so it is all self-contained for thinning later
├── main.bcf # The actual bcf file
├── main.bcf.csi # the index of the bcf file
├── sections # a temp() directory that is used to scatter-gather the whole thing
└── thinned # a directory to hold the thinned versions of the BCF
├── 10000_5000 # this means every 10000-th variant was retained, starting at variant number 5000
│ ├── info # same structure as info above
│ ├── main.bcf
│ ├── main.bcf.csi
│ └── sections
└── 1078_50
├── info
├── main.bcf
├── main.bcf.csi
└── sections
Some notes:
results/bcf/{bcf_id}/sections/{scaff_group}.
with extensions: bcf
, bcf_stats.txt
, positions.tsv.gz
thinned
stuff happens later, requesting the bcf_id specifically.This file is now going to be updated to have these columns:
While working on this and debugging, here are some snakemake invocations I have been using:
# for bcftools_subsamp_and_filt_scatter
snakemake -np --cores 1 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/sections/scaff_grp_0001.scaff_members.tsv --configfile .test/config/config.yaml
# for bcftools_subsamp_and_filt_gather
snakemake -p --cores 8 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/main.bcf --configfile .test/config/config.yaml
More notes:
awk -v id='{wildcards.bcf_id}' -f workflow/scripts/tsv_row2yaml.awk {input.bcf_tsv} > {output.info}
doesn't work great if the bcfs.tsv file has Windows line endings. Lame!
I am changing the paths so that they go like this:
results/bcf/{bcf_id}/thin_0_0/main.bcf
The reason is that this will make it easier to deal with cases with and without thinning. Because then we can request files like:
results/bcf/{bcf_id}/thin_0_0/main.bcf
# and
results/bcf/{bcf_id}/thin_10000_500/main.bcf
using the same syntax, etc.
So, now the resulting directory tree from scatter-gathering the filtering of bcfs looks like this:
results/
├── bcf
│ └── TEST-Biallelic-SNPS-5-fish-maf-0.1
│ └── thin_0_0
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
Of course, having done that, we will need to constrain the {thin_int}
and {thin_start}
wildcards to never be "0".
Indeed! Snakemake barked an error so I now have a regex demanding that they by positive integers.
So, I was running it through various thinnings. Let's say I want to make a bcf with 5 indivs in it and then I want to subsample the variants, taking every 6th, or 10, and for each of those, starting on 1 or 4. Then I could do this:
snakemake -np --cores 8 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{6,10}_{1,4}/main.bcf --configfile .test/config/config.yaml
which gives me this:
job count min threads max threads
--------------------------------- ------- ------------- -------------
bcftools_subsamp_and_filt_gather 1 1 1
bcftools_subsamp_and_filt_scatter 41 1 1
bcftools_thin_gather 4 1 1
bcftools_thin_scatter 164 1 1
total 210 1 1
When I run it, this is what results/bcf
looks like afterward:
results/bcf
└── TEST-Biallelic-SNPS-5-fish-maf-0.1
├── thin_0_0
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ ├── main.bcf.csi
│ └── sections
│ ├── scaff_grp_0001.positions.tsv.gz
│ ├── scaff_grp_0002.positions.tsv.gz
│ ├── scaff_grp_0003.positions.tsv.gz
│ ├── scaff_grp_0004.positions.tsv.gz
│ ├── scaff_grp_0005.positions.tsv.gz
│ ├── scaff_grp_0006.positions.tsv.gz
│ ├── scaff_grp_0007.positions.tsv.gz
│ ├── scaff_grp_0008.positions.tsv.gz
│ ├── scaff_grp_0009.positions.tsv.gz
│ ├── scaff_grp_0010.positions.tsv.gz
│ ├── scaff_grp_0011.positions.tsv.gz
│ ├── scaff_grp_0012.positions.tsv.gz
│ ├── scaff_grp_0013.positions.tsv.gz
│ ├── scaff_grp_0014.positions.tsv.gz
│ ├── scaff_grp_0015.positions.tsv.gz
│ ├── scaff_grp_0016.positions.tsv.gz
│ ├── scaff_grp_0017.positions.tsv.gz
│ ├── scaff_grp_0018.positions.tsv.gz
│ ├── scaff_grp_0019.positions.tsv.gz
│ ├── scaff_grp_0020.positions.tsv.gz
│ ├── scaff_grp_0021.positions.tsv.gz
│ ├── scaff_grp_0022.positions.tsv.gz
│ ├── scaff_grp_0023.positions.tsv.gz
│ ├── scaff_grp_0024.positions.tsv.gz
│ ├── scaff_grp_0025.positions.tsv.gz
│ ├── scaff_grp_0026.positions.tsv.gz
│ ├── scaff_grp_0027.positions.tsv.gz
│ ├── scaff_grp_0028.positions.tsv.gz
│ ├── scaff_grp_0029.positions.tsv.gz
│ ├── scaff_grp_0030.positions.tsv.gz
│ ├── scaff_grp_0031.positions.tsv.gz
│ ├── scaff_grp_0032.positions.tsv.gz
│ ├── scaff_grp_0033.positions.tsv.gz
│ ├── scaff_grp_0034.positions.tsv.gz
│ ├── scaff_grp_0035.positions.tsv.gz
│ ├── scaff_grp_0036.positions.tsv.gz
│ ├── scaff_grp_0037.positions.tsv.gz
│ ├── scaff_grp_0038.positions.tsv.gz
│ ├── scaff_grp_0039.positions.tsv.gz
│ ├── scaff_grp_0040.positions.tsv.gz
│ └── scaff_grp_0041.positions.tsv.gz
├── thin_10_1
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
├── thin_10_4
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
├── thin_6_1
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
└── thin_6_4
├── info
│ ├── bcf_stats.txt
│ ├── info.txt
│ ├── positions.tsv.gz
│ ├── sample_subset_file.txt
│ ├── samples.txt
│ └── scaffold_groups.tsv
├── main.bcf
└── main.bcf.csi
I can't have the positions files of the sections of the thin_0_0 directory be temporary because they will always be useful for future thinning. So, I just kept them there. That is pretty spot on!
results/bcf/{bcf_id}/scaff_members/
Now the directory structure after our canonical run looks like this:
results/bcf
└── TEST-Biallelic-SNPS-5-fish-maf-0.1
├── scaff_members
│ ├── scaff_grp_0001.scaff_members.tsv
│ ├── scaff_grp_0002.scaff_members.tsv
│ ├── scaff_grp_0003.scaff_members.tsv
│ ├── scaff_grp_0004.scaff_members.tsv
│ ├── scaff_grp_0005.scaff_members.tsv
│ ├── scaff_grp_0006.scaff_members.tsv
│ ├── scaff_grp_0007.scaff_members.tsv
│ ├── scaff_grp_0008.scaff_members.tsv
│ ├── scaff_grp_0009.scaff_members.tsv
│ ├── scaff_grp_0010.scaff_members.tsv
│ ├── scaff_grp_0011.scaff_members.tsv
│ ├── scaff_grp_0012.scaff_members.tsv
│ ├── scaff_grp_0013.scaff_members.tsv
│ ├── scaff_grp_0014.scaff_members.tsv
│ ├── scaff_grp_0015.scaff_members.tsv
│ ├── scaff_grp_0016.scaff_members.tsv
│ ├── scaff_grp_0017.scaff_members.tsv
│ ├── scaff_grp_0018.scaff_members.tsv
│ ├── scaff_grp_0019.scaff_members.tsv
│ ├── scaff_grp_0020.scaff_members.tsv
│ ├── scaff_grp_0021.scaff_members.tsv
│ ├── scaff_grp_0022.scaff_members.tsv
│ ├── scaff_grp_0023.scaff_members.tsv
│ ├── scaff_grp_0024.scaff_members.tsv
│ ├── scaff_grp_0025.scaff_members.tsv
│ ├── scaff_grp_0026.scaff_members.tsv
│ ├── scaff_grp_0027.scaff_members.tsv
│ ├── scaff_grp_0028.scaff_members.tsv
│ ├── scaff_grp_0029.scaff_members.tsv
│ ├── scaff_grp_0030.scaff_members.tsv
│ ├── scaff_grp_0031.scaff_members.tsv
│ ├── scaff_grp_0032.scaff_members.tsv
│ ├── scaff_grp_0033.scaff_members.tsv
│ ├── scaff_grp_0034.scaff_members.tsv
│ ├── scaff_grp_0035.scaff_members.tsv
│ ├── scaff_grp_0036.scaff_members.tsv
│ ├── scaff_grp_0037.scaff_members.tsv
│ ├── scaff_grp_0038.scaff_members.tsv
│ ├── scaff_grp_0039.scaff_members.tsv
│ ├── scaff_grp_0040.scaff_members.tsv
│ └── scaff_grp_0041.scaff_members.tsv
├── thin_0_0
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ ├── main.bcf.csi
│ └── sections
│ ├── scaff_grp_0001.positions.tsv.gz
│ ├── scaff_grp_0002.positions.tsv.gz
│ ├── scaff_grp_0003.positions.tsv.gz
│ ├── scaff_grp_0004.positions.tsv.gz
│ ├── scaff_grp_0005.positions.tsv.gz
│ ├── scaff_grp_0006.positions.tsv.gz
│ ├── scaff_grp_0007.positions.tsv.gz
│ ├── scaff_grp_0008.positions.tsv.gz
│ ├── scaff_grp_0009.positions.tsv.gz
│ ├── scaff_grp_0010.positions.tsv.gz
│ ├── scaff_grp_0011.positions.tsv.gz
│ ├── scaff_grp_0012.positions.tsv.gz
│ ├── scaff_grp_0013.positions.tsv.gz
│ ├── scaff_grp_0014.positions.tsv.gz
│ ├── scaff_grp_0015.positions.tsv.gz
│ ├── scaff_grp_0016.positions.tsv.gz
│ ├── scaff_grp_0017.positions.tsv.gz
│ ├── scaff_grp_0018.positions.tsv.gz
│ ├── scaff_grp_0019.positions.tsv.gz
│ ├── scaff_grp_0020.positions.tsv.gz
│ ├── scaff_grp_0021.positions.tsv.gz
│ ├── scaff_grp_0022.positions.tsv.gz
│ ├── scaff_grp_0023.positions.tsv.gz
│ ├── scaff_grp_0024.positions.tsv.gz
│ ├── scaff_grp_0025.positions.tsv.gz
│ ├── scaff_grp_0026.positions.tsv.gz
│ ├── scaff_grp_0027.positions.tsv.gz
│ ├── scaff_grp_0028.positions.tsv.gz
│ ├── scaff_grp_0029.positions.tsv.gz
│ ├── scaff_grp_0030.positions.tsv.gz
│ ├── scaff_grp_0031.positions.tsv.gz
│ ├── scaff_grp_0032.positions.tsv.gz
│ ├── scaff_grp_0033.positions.tsv.gz
│ ├── scaff_grp_0034.positions.tsv.gz
│ ├── scaff_grp_0035.positions.tsv.gz
│ ├── scaff_grp_0036.positions.tsv.gz
│ ├── scaff_grp_0037.positions.tsv.gz
│ ├── scaff_grp_0038.positions.tsv.gz
│ ├── scaff_grp_0039.positions.tsv.gz
│ ├── scaff_grp_0040.positions.tsv.gz
│ └── scaff_grp_0041.positions.tsv.gz
├── thin_10_1
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
├── thin_10_4
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
├── thin_6_1
│ ├── info
│ │ ├── bcf_stats.txt
│ │ ├── info.txt
│ │ ├── positions.tsv.gz
│ │ ├── sample_subset_file.txt
│ │ ├── samples.txt
│ │ └── scaffold_groups.tsv
│ ├── main.bcf
│ └── main.bcf.csi
└── thin_6_4
├── info
│ ├── bcf_stats.txt
│ ├── info.txt
│ ├── positions.tsv.gz
│ ├── sample_subset_file.txt
│ ├── samples.txt
│ └── scaffold_groups.tsv
├── main.bcf
└── main.bcf.csi
I used the ruleorder
directive to deal with the thin_0_0
cases. Yay! It seems to work all the way through the beagle-gl files. Yay!
Here is the command that does it:
snakemake -np --cores 8 --use-conda results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/beagle-gl.gz --configfile .test/config/config.yaml
Which gives this:
job count min threads max threads
--------------------------------- ------- ------------- -------------
bcftools_subsamp_and_filt_gather 1 1 1
bcftools_subsamp_and_filt_scatter 41 1 1
bcftools_thin_gather 2 1 1
bcftools_thin_scatter 82 1 1
beagle3_glikes_from_PL_scattered 123 1 1
gather_beagle3_glikes_from_PL 3 1 1
total 252 1 1
Gonna deal with these:
So, for that one, my plan is to have a rule that cats the top-row and the body of each beagle.gz file.
here, is a test invocation:
snakemake -np --cores 8 --use-conda \
results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/beagle-gl.gz \
results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/sections/scaff_grp_000{1..9}.beagle-gl.gz \
--configfile .test/config/config.yaml
pcangsd does not seem to come with the standard angsd conda distro. Rather it can be downloaded from github and installed. It seems to get installed into whatever conda environment one is using, so I can make a rule that installs it into our pcangsd conda env.
This only runs on the linux, so we have to test it on the cluster. I am going to go ahead and test it with the
data set of 454 fish. I have a new line in the bcfs.tsv file for it with ID: TEST-Biallelic-SNPS-460-fish-maf-0.05
.
I can generate the beagle.gz file for it like this on the cluster with a whole node checked out:
snakemake -p --cores 20 --use-conda results/beagle-gl/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/beagle-gl.gz --configfile .test/config/config.yaml
The resulting beagle.gz file has 3168 variants in it. This should be a decent data set for making sure I have pcangsd up and running well.
import numpy as np
pi=np.load("TEST-PCANGSD.pi.npy")
import pandas as pd df = pd.DataFrame(pi)
piv=pi.flatten(order='F') piv=pi.flatten(order='F') # turns it into a vector
I'll play with these are find a quick, memory-reasonable way of computing the genotype posteriors and writing them
back, out to a beagle.gz file.
There really is no --post_save
in this new cythonized version of pcangsd. Perhaps because it is too hard to write it out?
Here is where the code for calling genotypes lives:
https://github.com/Rosemeis/pcangsd/blob/master/pcangsd/shared_cy.pyx#L100-L130
And it is pretty clear what one would have to do to make that output a big flat file of genotype posteriors, I think...
Having looked over the code in pcangsd, it is clear that they read the beagle geno likes directly into a numpy array and they toss the information about marker names and the alleles. So, if we are going to try to create a beagle file with posteriors, we won't have the locus information. Not to worry. We will write it out as a text file and then just make another layer in our workflow to reattach that locus information (accounting for any filtered sites...)
To do this, I have made a fork of the pcangsd repository. The plan is:
--post_save
option back in there.Here is what a test run on that looks like:
BEAG=/home/eanderson/Documents/git-repos/mega-post-bcf-exploratory-snakeflows/results/beagle-gl/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/beagle-gl.gz
pcangsd -b $BEAG --geno 0.7 --pi_save -t 20 --post_save --out quick-test --minMaf 0.0
I have that working now and pulled it into main on my fork. But the file writing is tremendously slow!! It is slow whether I write it with numpy.savetxt directly, or if I turn it into a Pandas data frame. It is only 254 rows with 3168 markers, or so, and it takes 16 seconds. WTF??? That would be like 4.4 hours with 3 million markers. That seems ridiculously slow. Is there a way we could just write it from cython and have it be faster.
I updated this by using C's fprintf() function to write the array to disk. It is much faster now.
In a perfect world, I would figure out how to read the marker names and sample names in and also write that out as a proper BEAGLE file to a pipe that gzipped it. But, alas, that is too much for now.
So, instead I write it as uncompressed TSV and then I will paste the marker names and cat the header names to it.
Adding pcangsd stuff in there. At this point, here is one invocation for testing:
snakemake -np --cores 20 --use-conda results/pcangsd/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/out.gposts.tsv --configfile .test/config/config.yaml
This returns reasonable values for the posteriors, but the covariance matrix is all NaNs and Infs. WTF? I am wondering if pcangsd does not deal well with very small numbers in exponential notation. I've noticed that the Beagle files used as examples in various ANGSD tutorials, and in the angsd website are just floats with 6 decimal precision. (i.e. %.6f). So, I will want to try to change mine to that.
But I also should play around, first, with dropping sites with more than 50% missing data (not that that should make any difference...but I should do that anyway.
I did both of those things and also changed the alleles to 0,1,2,3 instead of 1,2,3,4, and still the cov matrix is all infs and nans.
snakemake -np --cores 20 --use-conda results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_0_0/out.gpost.tsv --configfile .test/config/config.yaml
Then I did a bunch of exploring around. If I filter out any site that has a geno likelihood of 0.3333 it works. However, filtering sites by F_MISSING doesn't really work.
Then I went and looked at the BCF files and there are lots of sites with 1 or 2 reads for, say, the reference allele, but the phred-scaled genotype likelihoods are still 0,0,0. What is up with that. Has the BQSR completely trashed this, or is it actually trying to use sites with really low base quality scores? Completely lame.
I also peered through the pcangsd code. The individual allele freqs are estimated separately and before the covariance matrix. So, it is conceivable that the indiv allele freqs are being calculated correctly, and then it is borking on the covariance matrix.
Working along this more, it appears that covariance_cy.updateDosages
isa function that computes the posterior genotype probs but then uses them to write out the dosages. So that is no real help.
We still have to figure out why the covariance matrix is borked.
I output the mafs and genotype dosages to inspect. The estimated mafs actually have some MAFs of 1.0. That seems pretty messed up. I will go back to the BCF file and check it out.
Sure enough:
NC_056432.1 47853288 . A C 62394 PASS AC=48;AF=0.827586;AN=58;BaseQRankSum=0.431;DP=1622;ExcessHet=0;FS=0;InbreedingCoeff=0.3872;MLEAC=955;MLEAF=0.995;MQ=59.99;MQRankSum=0;NMISS=25;QD=27.72;ReadPosRankSum=0.967;SOR=0.23;NS=29;MAF=0.172414;AC_Het=0;AC_Hom=48;AC_Hemi=0;HWE=2.27589e-06;ExcHet=1 GT:AD:DP:GQ:PGT:PID:PL:PS 1/1:0,4:4:12:.:.:124,12,0:. 1/1:0,7:7:21:.:.:210,21,0:. 1/1:0,7:7:21:.:.:193,21,0:. 1/1:0,3:3:9:.:.:99,9,0:. 1/1:0,2:2:6:.:.:68,6,0:. 0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,6:6:18:.:.:183,18,0:. 1/1:0,3:3:9:.:.:92,9,0:. 1/1:0,9:9:27:.:.:295,27,0:. 1/1:0,4:4:12:.:.:135,12,0:. 1/1:0,2:2:6:.:.:69,6,0:. 1/1:0,2:2:6:.:.:68,6,0:. 1/1:0,4:4:12:.:.:117,12,0:. 1/1:0,2:2:6:.:.:68,6,0:. 1/1:0,2:2:6:.:.:67,6,0:. 1/1:0,2:2:6:.:.:68,6,0:. 1/1:0,5:5:15:.:.:160,15,0:. 1/1:0,3:3:9:.:.:83,9,0:. 0/0:1,0:1:0:.:.:0,0,0:. 0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,3:3:9:.:.:101,9,0:. 0/0:1,0:1:0:.:.:0,0,0:. 0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,4:4:12:.:.:133,12,0:. 1/1:0,5:5:15:.:.:145,15,0:. 1/1:0,3:3:9:.:.:82,9,0:. 1/1:0,8:8:24:.:.:228,24,0:. 1/1:0,3:3:9:.:.:96,9,0:. 1/1:0,4:4:12:.:.:135,12,0:.
That lovely locus is treated by bcftools as having a minor allele frequency of 0.1724, but all of those "minor" alleles are simply in reference homozygotes that have 0,0,0 for the genotype likelihoods. That is some messed up shit.
Clearly, if I am to filter these in the wgs workflow I should use MLEAF, but, sadly that is not computed on the fly by bcftools. What a disaster.
I think that perhaps the thing to do is sample selection with bcftools, and maybe some light filtering, but no filtering on MAFs. Then use angsd to compute the allele frequencies, and get a mafs.gz file from which I could then filter the beagle file on MAF just using awk or something. Of course, angsd treats the REF as the Major and Alt as the minor in that context, so I would want to filter on (AF > maf_cutoff & AF < 0.5) | (AF > 0.5 & AF < 1 - maf_cutoff). It seems then that I could send that off to pcangsd.
It turns out that ANGSD itself is fricking SHIT-SHOW because it uses underscores to separate chromosomes from positions, so that a chromosome or scaffold with a name like NC_03451.1 is read as chromosome of NC followed by a position.
BUT, looking at the code for pcangsd, the MAF filtering is done correctly as a MAF filter:
maskMAF = (f >= args.minMaf) & (f <= (1 - args.minMaf))
So, we could do the MAF filtering in pcangsd and then reconstitute the marker names and beagle file as necessary with some awk and paste. Phew!!
So, it looks like in order to make this stuff work with angsd we can change underscores in the locus names to dashes. We need to do that in the .fai file as well.
For now, I will just change things in the awk script that makes the beagle.gl.gz file.
I wrote code to bung together a beagle posteriors file from the pcangsd output and the header of the beagle-gl file.
The invocation I am working on for testing this is currently:
snakemake -np --cores 20 --use-conda results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_0_0/maf_0.04/beagle-post.gz --configfile .test/config/config.yaml
That is working now. Had to tweak the pipefail stuff because I was using head, etc. Lame.
I am now ready to test it on, let's say close to a million SNPs.
We have about 6.5 million snps in all-pass-maf-0.05 for the all-together yukon. So, I will just throw another line in the bcfs.tsv for them.
I am doing this in:
/home/eanderson/scratch/PROJECTS/mega-post-bcf-exploratory-snakeflows
test run:
snakemake -np --cores 20 --use-conda results/pcangsd/Mondo-454-fish-maf-0.05/thin_6_1/maf_0.04/beagle-post.gz --configfile .test/config/config.yaml
I ran that using the sedna slurm profile and it did great. Got to the pcangsd step, and then it botched---it couldn't find pcangsd. So I restarted it after removing the pcangsd install flag, and I ran it locally on a checked out core, and it had no further problems.
Starting to figure out the ANGSD asso stuff. Just playing around with a 10K marker set. Here is what I was doing (re-tracing steps that I did not record at the time, a week ago, before vacation...)
# this is working:
angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4
# currently in: /home/eanderson/scratch/PROJECTS/mega-post-bcf-exploratory-snakeflows/play
# here is what the files look like:
(angsd-0.937) [sedna: play]--% zcat beag-10k-post.gz | head | awk 'BEGIN {OFS="\t"} {print $1, $2, $3, $4, $5, $6, $7, $8, $9}'
marker allele1 allele2 T199967 T199967 T199967 T199968 T199968 T199968
NC-056429.1_523 2 3 1.5930e-03 7.6638e-02 9.2177e-01 0.0000e+00 2.6336e-03 9.9737e-01
NC-056429.1_3384 1 3 0.0000e+00 9.9571e-01 4.2881e-03 0.0000e+00 1.1130e-01 8.8870e-01
NC-056429.1_9249 3 2 0.0000e+00 1.0315e-02 9.8969e-01 8.0988e-03 1.6379e-01 8.2811e-01
NC-056429.1_16498 0 2 3.5444e-03 1.1198e-01 8.8448e-01 0.0000e+00 3.2561e-02 9.6744e-01
NC-056429.1_26537 0 1 7.5160e-01 2.4840e-01 0.0000e+00 4.7456e-05 9.9995e-01 4.7189e-10
NC-056429.1_29392 0 2 1.1095e-05 9.9999e-01 7.6503e-10 8.7806e-01 1.2192e-01 2.1214e-05
NC-056429.1_34425 2 0 2.2667e-02 2.5578e-01 7.2156e-01 0.0000e+00 1.5535e-01 8.4465e-01
NC-056429.1_44032 3 1 0.0000e+00 2.9467e-02 9.7053e-01 0.0000e+00 1.3273e-02 9.8673e-01
NC-056429.1_51167 1 3 0.0000e+00 1.9507e-02 9.8049e-01 0.0000e+00 6.9427e-02 9.3057e-01
# the fai file has to have the angsd-ified chrom names, too---underscores are turned to dashes
(angsd-0.937) [sedna: play]--% head genome.fasta.fai
NC-056429.1 95801903 117 80 81
NC-056430.1 75537397 96999661 80 81
NC-056431.1 81554755 173481393 80 81
NC-056432.1 75613410 256055700 80 81
NC-056433.1 90317969 332614395 80 81
NC-056434.1 80340224 424061456 80 81
NC-056435.1 77130101 505406050 80 81
NC-056436.1 91267893 583500395 80 81
NC-056437.1 86417572 675909254 80 81
NC-056438.1 84826039 763407163 80 81
# the samples file is in this format:
# it turns out that all the covariates, including discrete ones like sex have to be integers,
# it appears. It bombed when I used Ms and Fs.
(angsd-0.937) [sedna: play]--% head dot_samples2.tsv
ID age sex PC1 PC2 PC3 PC4
0 B D C C C C
T199967 1 1 0.000466737846601796 -0.0355754261982327 0.0410549697242467 -0.0956605172102598
T199968 1 0 -0.0485469374738537 0.00278541349388417 0.0693590254462809 0.0654561161382242
T199969 1 1 0.0369939568333824 -0.0710541710300985 -0.0175538262788173 0.0121487002814868
T199970 1 0 -0.0434925140427411 0.00378501130744979 0.130358439300395 0.0609703909488089
T199971 1 0 -0.0217766288507172 -0.0214415058343114 0.0763449795915192 -0.181235243692794
T199972 1 0 0.0450978210333021 -0.0309180469622904 -0.00769339107871081 0.0243643715339333
T199973 1 0 -0.0546464166951503 0.0121567360058559 -0.0774412068776019 -0.00794532933579045
T199974 1 0 -0.0491440899457988 0.0129302261721018 -0.0109094686470035 0.00193556110694826
After that running successfully, I find myself wondering if the dot_samples file can:
We can experiment with those two things here and now.
First, rerun what we did before:
(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4
-> angsd version: 0.937 (htslib: 1.15.1) build(Mar 3 2022 00:51:08)
-> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4
-> No '-out' argument given, output files will be called 'angsdput'
-> Inputtype is beagle
-> Printing at chr: NC-056429.1 pos:34484806 chunknumber 200 contains 50 sitesDone reading beagle
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"angsdput.arg"
->"angsdput.lrt0.gz"
-> Sat Jul 30 08:43:20 2022
-> Arguments and parameters for all analysis are located in .arg file
-> Total number of sites analyzed: 10000
-> Number of sites retained after filtering: 10000
[ALL done] cpu-time used = 140.11 sec
[ALL done] walltime used = 140.00 sec
Then we try it with just the first 30 individuals:
(angsd-0.937) [sedna: play]--% head -n 32 dot_samples2.tsv > first30_dot_samples.tsv
(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile first30_dot_samples.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out first30
-> angsd version: 0.937 (htslib: 1.15.1) build(Mar 3 2022 00:51:08)
-> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile first30_dot_samples.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out first30
-> Inputtype is beagle
The number of sequenced individuals (454) does not match the number of phenotypes (30)
OK, that is pretty definitive. The number of seqenced individuals has to match the number of phenotypes.
What about if they are in a different order? Will it be able to work that out? I doubt it, but we can try...
# put them in a different order than they are in the genotypes file:
(head -n 2 dot_samples2.tsv; awk 'BEGIN{OFS="\t"} NR>2 {print rand(),$0}' dot_samples2.tsv | sort -n -b -k 1 | cut -f 2-) > dot_scrambled.tsv
# then try running it:
(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_scrambled.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out scrambled
-> angsd version: 0.937 (htslib: 1.15.1) build(Mar 3 2022 00:51:08)
-> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_scrambled.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out scrambled
-> Inputtype is beagle
-> Printing at chr: NC-056429.1 pos:34484806 chunknumber 200 contains 50 sitesDone reading beagle
-> Done reading data waiting for calculations to finish
-> Done waiting for threads
-> Output filenames:
->"scrambled.arg"
->"scrambled.lrt0.gz"
-> Sat Jul 30 08:58:32 2022
-> Arguments and parameters for all analysis are located in .arg file
-> Total number of sites analyzed: 10000
-> Number of sites retained after filtering: 10000
[ALL done] cpu-time used = 139.58 sec
[ALL done] walltime used = 140.00 sec
# So, that works. Let's verify that the results are the same:
(angsd-0.937) [sedna: play]--% zcat angsdput.lrt0.gz | head
Chromosome Position Major Minor Frequency N LRT beta SE high_WT/HE/HO emIter
NC-056429.1 523 G T 0.949794 454 2.013597 0.637792 0.402173 0/14/363 2
NC-056429.1 3384 C T 0.313106 454 0.134049 -0.085042 0.224399 84/42/5 1
NC-056429.1 9249 T G 0.930452 454 4.491319 0.877984 0.375863 0/22/308 3
NC-056429.1 16498 A G 0.939332 454 4.391099 0.869978 0.395756 0/15/310 4
NC-056429.1 26537 A C 0.131353 454 1.801782 0.370892 0.286734 234/38/0 2
NC-056429.1 29392 A G 0.169399 454 0.611648 0.177085 0.228759 215/30/3 1
NC-056429.1 34425 G A 0.880862 454 1.208272 0.341601 0.308311 0/40/291 2
NC-056429.1 44032 T C 0.943822 454 5.188045 0.963013 0.406805 0/16/327 2
NC-056429.1 51167 C T 0.922563 454 7.556225 0.981645 0.362819 0/27/319 3
(angsd-0.937) [sedna: play]--% zcat scrambled.lrt0.gz | head
Chromosome Position Major Minor Frequency N LRT beta SE high_WT/HE/HO emIter
NC-056429.1 523 G T 0.949794 454 0.001338 -0.016129 0.462097 0/14/363 1
NC-056429.1 3384 C T 0.313106 454 0.437107 -0.126122 0.192100 84/42/5 1
NC-056429.1 9249 T G 0.930452 454 0.203409 0.170942 0.380376 0/22/308 2
NC-056429.1 16498 A G 0.939332 454 0.092406 -0.122712 0.449788 0/15/310 1
NC-056429.1 26537 A C 0.131353 454 1.180838 0.303640 0.294147 234/38/0 2
NC-056429.1 29392 A G 0.169399 454 0.030809 -0.037865 0.208772 215/30/3 1
NC-056429.1 34425 G A 0.880862 454 0.693206 -0.221533 0.293014 0/40/291 1
NC-056429.1 44032 T C 0.943822 454 0.999333 -0.427150 0.430205 0/16/327 2
NC-056429.1 51167 C T 0.922563 454 1.679021 -0.473920 0.381925 0/27/319 2
So, those LRT results are nowhere near the same. I suspect that ANGSD does not check the order of the samples in the two different files. Good to know.
Here is the command for using my beagle-slicer gawk script:
(base) /mega-post-bcf-exploratory-snakeflows/--% (main) pwd
/Users/eriq/Documents/git-repos/mega-post-bcf-exploratory-snakeflows
(base) /mega-post-bcf-exploratory-snakeflows/--% (main) mkdir -p testy
gzcat fracto-beagle-post.gz | cat .test/bcf/test.bcf.scaff_groups.tsv - | gawk -v path=testy -f workflow/scripts/beagle-slicer.awk
That seems to be working....
Working on testing this by going for thinning by 10000. A typical file that I might want would be
# "results/pcangsd/{{bcf_id}}/thin_{{thin_int}}_{{thin_start}}/maf_{{min_maf}}/sections/{asg}-beagle-post.gz"
results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_10000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz
# so that this is a good way to dry run it:
snakemake -np --cores 20 --use-conda results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_10000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config/config.yaml
When thinning by 10,000, there are some scaffold groups that end up with no markers. And that causes the beagle slice rule to fail. So, I did it thinning by 1000.
Oh wait! The problem might be that I am using the test data set which actually might not have representatives from all scaffold groups, or something like that. Oops.
So, try this:
snakemake -p --cores 20 --use-conda results/pcangsd/Mondo-454-fish-maf-0.05/thin_5000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config/config.yaml
And that ran to completion.
Now, it looks like I just need a rule to run angsd doAsso on each of those.
I ran the whole thing through to the end of the asso with:
snakemake -p --cores 20 --use-conda --configfile .test/config/config.yaml
The required outputs were in rule all. That was for thinning by 6. But now I want to run it on the whole data set.
That will be chronicled in the notebook at:
/Users/eriq/Documents/projects/yukon-chinookomes/005-association-round-two.Rmd
I have grown completely disenchanted with configuring runs in a TSV file. There are some things that really work poorly for that!! So, I am going to:
I am starting this on the bcf_samps_and_filt_scatter rule, so our test target can be:
snakemake --use-conda -np results/bcf_testy/filt_snps05/yc30/thin_0_0/sections/scaff_grp_0001.bcf --configfile .test/config/config.yaml
And after that worked, we go for the gather rule after that.
snakemake --use-conda -np results/bcf_testy/filt_snps05/yc30/thin_0_0/main.bcf --configfile .test/config/config.yaml
And once that was working, I revamped the thinning rules so that we can now test with:
snakemake --use-conda -np results/bcf_testy/filt_snps05/{yc30,yc_males,yc_females}/thin_6_1/main.bcf --configfile .test/config/config.yaml
Then, as I started working on the stuff in formats.smk
, I can use:
snakemake --use-conda -np results/bcf_testy/filt_snps05/{yc30,yc_males}/{thin_6_1,thin_0_0}/beagle-gl/beagle-gl.gz --configfile .test/config/config.yaml
Things are all done up through formats.smk at this point.
Now I just need to revamp the pcangsd stuff....
Working on the pcangsd stuff I have it through:
snakemake --use-conda -np results/bcf_testy/filt_snps05/yc30/thin_0_0/pcangsd/maf_0.05/beagle-post.gz --configfile .test/config/config.yaml
And I will check to see if I can get the sliced beagle file:
snakemake --use-conda -np results/bcf_testy/filt_snps05/yc30/thin_0_0/pcangsd/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config/config.yaml
I have taken this as far as I can go at the moment. I will pull it all into the main branch now.
I just need to add a few more things. Note that the bcf section of the config can look like this:
bcf:
testy:
path: ".test/bcf/test.bcf"
genome_path: resources/test.bcf.genome.fasta
scaff_group_path: .test/bcf/test.bcf.scaff_groups.tsv
sample_subsets:
yc30:
path: .test/config/sample_subsets/yc30.txt # 30 of the Yukon Chinook
yc454:
path: .test/config/sample_subsets/yc454.txt # all 454 Yukon River Chinook
dotsample:
yc_males:
path: .test/config/sample_subsets/yc_males.txt # just the males
yc_females:
path: .test/config/sample_subsets/yc_females.txt # just the females
The key point there is that we can have a dotsample
for it. This should be the dotsample with all the phenotype
information and the covariate information except for the principal components.
I think this is far enough now that I can make some PCAs of the rockfish data.
Now, some other major overhauls
Having overhauled the .test directory structure pretty extensively, we can now run/test this thing like this:
snakemake --use-conda -np results/bcf_testy/filt_snps05/{all,males,females}/thin_0_0/pcangsd/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config.yaml
I committed the changes and ran it on SEDNA and the above ran perfectly.
OK! I am now in a perfect position to start overhauling the do-asso stuff to make it more extensible from within the config file actions sections. This will take some thought. I also want to include plotting of manhattan plots and also the pca bi-plots, but maybe that is better left for interactive work....
At any rate, next up is:
This is in a rule called attach_PCs_to_dotsamples
Here is a target file for it:
results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv
So, we can try to give it a run:
snakemake --use-conda -np results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv --configfile .test/config.yaml
That checks out on the dry-run, but I will have to run it on the cluster, cuz I haven't yet been able to compile pcangsd up on the mac.
So, on SEDNA, we try:
snakemake -np --use-conda --use-envmodules results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv --configfile .test/config.yaml
That is all working. So, now it is on to the next step.
This turns out to be somewhat complicated, but I think I have something worked out now.
Here is the command that will get it going:
snakemake --use-conda -np results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/sections/scaff_grp_0001.lrt0.gz --configfile .test/config.yaml
And I ran it on SEDNA with this:
snakemake --cores 20 --use-conda --use-envmodules results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/sections/scaff_grp_0001.lrt0.gz --configfile .test/config.yaml
And it was successful.
I added that and now I can do this on SEDNA:
snakemake --cores 20 --use-conda --use-envmodules results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/all-scaff-groups.lrt0.gz --configfile .test/config.yaml
That is all working great.
I am going to add a directory of example configs that will be a good place for me to store my configs that I use for particular projects, and thus keep them synced up.
I have a function expand_targets()
that reads the "targets" section of the config to request
output files in rule all
.
Here is how it is described:
#### PARAMETERS SECTION
# Here we can give some short names for different sets of parameters
# that we will use later for particular ultimate targets
# these are all the standard parameters in the file paths
main_params:
standard:
bcf: testy
filt: snps05
thin_spec: "0_0"
maf: 0.05
# Here we provide different sets of parameters for particular
# analyses.
params:
do_asso:
age_cohort_4PCs:
angsd_opts: " -doMaf 4 -doAsso 4 "
whichPhe: " age "
whichCov: " cohort,PC-1,PC-2,PC-3,PC-4 "
#### TARGETS SECTION
# here we have a shorthand for describing target outputs.
# Each output is defined by a triple: [A, B, C] that lives under
# a heading which says which kind of analysis is desired.
# where
# - A is a nickname from main_params
# - B is a nickname for a subsample that is appropriate to the BCF in A
# - C is a nickname for the parameters defined in params for the particular analysis
#
# The idea here is that each of these triples can be used to define a file name
# or possibly multiple filenames.
targets:
do_asso:
- ["standard", "all", "age_cohort_4PCs"]
- ["standard", "males", "age_cohort_4PCs"]
- ["standard", "females", "age_cohort_4PCs"]
As of this writing, I can do a dry run on the test config like this:
snakemake -np --use-conda --configfile .test/config.yaml
And it figures out from rule all
that the requested outputs are:
localrule all:
input: results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz, results/bcf_testy/filt_snps05/males/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz, results/bcf_testy/filt_snps05/females/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz
which is pretty rad. It then figures out that it has 348 jobs to do.
It might be better to just have the workflow read another yaml file called, say, "targets.yaml" whose location is defined in the config. With the new versions of Snakemake, any change in the config file might trigger re-runs of everything...
I am just starting to put this together. The basic idea is that when exploring NGS data that comes to you in a BCF file, there are lots of ways you might want to tweak it in the process. For example:
On top of that, once you start analyzing those data you will want to try different things:
So, these are all just the things I am thinking about as I start noodling around with a few things here.
I am thinking about a wildcarding/directory structure that looks something like this for the basic BCF maneuvers:
But, now that I think about it, I'd say the thinning factors should be part of the filtering criteria. So, instead, the basic structure would be:
My idea for these wildcards is, then:
{bcf_file}
is the tag given in a TSV file to a path to a BCF{thin_int}
and{thin_start}
These are there for doing simple deterministic thinning of the file, starting at variantthin_start
and then taking everythin_int
-th marker after that. If either of them is 0, there is no thinning done.{sample_subset}
is the name given to a subset of samples in the BCF file. The actual samples themselves are given in a file, typicallyconfig/sample_subsets/{sample_subset}.txt
, that can be read easily by bcftools or vcftools. The exact location of thesample_subsets
directory will be given in config.yaml.{filter_crit}
will match the ID line in a TSV file that has other columns, bcftools and vcftools, in which the relevant filtering option values are given. Replicates of thinning can be given with IDs like "FILT-1", "FILT-2", and so forth.Well, that is enough theory for now. We should probably start building some things up.
Hey! For parallelizing over genomic regions I am just going to have a single
scaffold_groups.tsv
file that includes both chromosomes and scaffolds. i.e. some scaffold groups might include just a single chromosome.News Flash! The path to that file is going to be a column in the
bcfs.tsv
, because we might have different BCFs that are mapped to different genomes, etc. Also, there should be a column that gives the path of the indexed genome, for some ANGSD applications I do believe.Required files
scaffold_groups.tsv
: two columns,id
andchrom
, in that order. Each row is a chromosome from the reference genome, in the order that they appear in the genome.