eriqande / mega-post-bcf-exploratory-snakeflows

1 stars 15 forks source link

Development Narrative #1

Open eriqande opened 2 years ago

eriqande commented 2 years ago

I am just starting to put this together. The basic idea is that when exploring NGS data that comes to you in a BCF file, there are lots of ways you might want to tweak it in the process. For example:

Use different BCF files
Use different subsets of individuals
Use different filtering criteria
Thin the number of markers to a different level
Thin the number of markers to the same level, but do different randomly sampled replicates of that.

On top of that, once you start analyzing those data you will want to try different things:

Different values of K with ngsAdmix
Different replicate runs of ngsAdmix
Different parameter settings of other packages
Some you will want to break over chromosomes to parallelize
Others, you can't break over chromosomes, etc.

So, these are all just the things I am thinking about as I start noodling around with a few things here.

I am thinking about a wildcarding/directory structure that looks something like this for the basic BCF maneuvers:

"bcf_{bcf_file}/samp-sub_{sample_subset}/filt_{filter_crit}/tf_{thinning_factor}/tf-seed_{thin_factor_seed}/"

But, now that I think about it, I'd say the thinning factors should be part of the filtering criteria. So, instead, the basic structure would be:

bcf_{bcf_file}/thin_{thin_int}_{thin_start}/samp-sub_{sample_subset}/filt_{filter_crit}/

My idea for these wildcards is, then:

{bcf_file} is the tag given in a TSV file to a path to a BCF
{thin_int} and {thin_start} These are there for doing simple deterministic thinning of the file, starting at variant thin_start and then taking every thin_int-th marker after that. If either of them is 0, there is no thinning done.
{sample_subset} is the name given to a subset of samples in the BCF file. The actual samples themselves are given in a file, typically config/sample_subsets/{sample_subset}.txt, that can be read easily by bcftools or vcftools. The exact location of the sample_subsets directory will be given in config.yaml.
{filter_crit} will match the ID line in a TSV file that has other columns, bcftools and vcftools, in which the relevant filtering option values are given. Replicates of thinning can be given with IDs like "FILT-1", "FILT-2", and so forth.

Well, that is enough theory for now. We should probably start building some things up.

Hey! For parallelizing over genomic regions I am just going to have a single scaffold_groups.tsv file that includes both chromosomes and scaffolds. i.e. some scaffold groups might include just a single chromosome.

News Flash! The path to that file is going to be a column in the bcfs.tsv, because we might have different BCFs that are mapped to different genomes, etc. Also, there should be a column that gives the path of the indexed genome, for some ANGSD applications I do believe.

Required files

scaffold_groups.tsv: two columns, id and chrom, in that order. Each row is a chromosome from the reference genome, in the order that they appear in the genome.

eriqande commented 2 years ago

Development

Wildcards in play at the moment

bcf_file
scaff_grp
sample_subset
bcftools_opts

Test data set

We could use a decent test data set for this. I have 480-something chinook, filtered to maf 0.01 that is about 31 Gb, and has around 14 million variants in it. If we sampled at a rate of 0.0068 we would end up with about 100,000 markers, but that would still be about 210 Mb. So, let's cut it down to 10,000 markers and around 22 Mb.

mamba create -n vcflib-1.0.3 -c bioconda vcflib
conda activate vcflib-1.0.3
module load bio/bcftools

bcftools view pass-maf-0.01.bcf | vcfrandomsample -r 0.00068 | bcftools view -Ob > /tmp/test.bcf

Ha! That is abominably slow! Granted it is streaming through the whole file to pick out less than 1 in a 1000.

So, while waiting for that, I try it a different way:

# print every 1000-th marker to a file:
 bcftools query -f '%CHROM\t%POS\n' pass-maf-0.01.bcf | awk 'NR%1000==0' > /tmp/get-these.txt
 # that should get us about 14000 markers.

# then, I will take only half of those (get us down to about 7000 markers)
awk 'NR%2==0' get-these.txt > markers.txt

# and then we can access them via the index with the -R option
bcftools view -Ob  -R /tmp/markers.txt  pass-maf-0.01.bcf  > ~/test.bcf

That got done by the time the vcflib streaming approach was still on marker

OK...,let's here it for log2N binary search of indexed BCF files.

The resulting BCF file is 19 Mb. Perhaps still too big for a real .test data set, but I can start playing with it now on my laptop, and on SEDNA.

scaff groups for it

I am going to make these from the chromosomes and scaff groups.

library(tidyverse)

chrom <- read_tsv("~/Documents/projects/yukon-chinookomes/chromosomes.tsv") %>%
  mutate(id = as.character(1:n(), .before = chrom)
) %>%
  rename(stop = num_bases)
sg <- read_tsv("~/Documents/projects/yukon-chinookomes/scaffold_groups.tsv") %>%
  select(id, chrom, len) %>%
  rename(stop = len)

both <- bind_rows(chrom, sg) %>%
  mutate(
    id2 = sprintf("scaff_grp_%04d", as.integer(factor(id, levels = unique(id))))
  ) %>%
  mutate(id = id2) %>%
  mutate(start = 1) %>%
  select(id, chrom, start, stop)

write_tsv(both, ".test/bcf/test.bcf.scaff_groups.tsv")

eriqande commented 2 years ago

Revamping this all

I decided that it would be better to have all the filtering up front in bcftools, and then we can do thinning from there. Complete rewrite, but I think it will serve us better.

So, the idea here is that we have a single file, bcfs.tsv that gives the columns

id: tag/name for the resulting filtered vcf. becomes wildcard {bcf_id}
sample_subset: path to the sample subset file. No longer becomes a wildcard. All that info should be in the id.
bcftools_opts: No longer becomes a wildcard
parent_bcf_path: the path to the parent bcf used. Does not become a wildcard.

So, the idea is that the {bcf_id} is what we use to name all these in the paths. So, an id should be something informative, like biallelic-SNPs-32-Yukon-Chinook-maf-0.05 or whatever. We don't rely on that id to give us all the information, because there will be an info directory that has that information in it.

Then, thinned versions of this

So, the output directory structure will look something like this:

results/bcf    # top level directory into which all of these bcf files will go
└── {bcf_id}
    ├── info
    │   ├── info.txt   # tsv of the different options that created this file like:   tag:\titem
    │   ├── bcf_stats.txt  # the bcftools stats output from main
    │   ├── positions.tsv.gz  # gzipped CHROM\tPOS
    │   ├── sample_subset.txt  #  The sample subset file that was applied
    │   └── samples.txt  # the actual samples in the files obtained by bcftools query -l
    │   └── scaffold_groups.tsv  # Copy the scaffold_groups file here so it is all self-contained for thinning later
    ├── main.bcf     # The actual bcf file
    ├── main.bcf.csi  # the index of the bcf file
    ├── sections   # a temp() directory that is used to scatter-gather the whole thing
    └── thinned  # a directory to hold the thinned versions of the BCF
        ├── 10000_5000   # this means every 10000-th variant was retained, starting at variant number 5000
        │   ├── info  # same structure as info above
        │   ├── main.bcf
        │   ├── main.bcf.csi
        │   └── sections
        └── 1078_50
            ├── info
            ├── main.bcf
            ├── main.bcf.csi
            └── sections

Some notes:

We have to start with the scattering step. The paths in that will be like:
results/bcf/{bcf_id}/sections/{scaff_group}. with extensions: bcf, bcf_stats.txt, positions.tsv.gz
The thinned stuff happens later, requesting the bcf_id specifically.
The filtering happens after sample subsetting. So, the sample subset is extracted in a bcftools command that pipes uncompressed bcf to the filtering step.
The scatter-gather is set up to use the scaffold_groups file, and is designed so that the only time that we end up traversing the entire, full bcf file (either the parent or the output) is to make the index. As a consequence, withing the scatters in sections we will do
- bcftools stats
- writing positions out to positions.tsv.gz and then cat-ing all those.

bcfs.tsv

This file is now going to be updated to have these columns:

id
parent_bcf_path
sample_subset_path
bcftools_opts
scaff_group_path
genome_path

While working on this and debugging, here are some snakemake invocations I have been using:

# for bcftools_subsamp_and_filt_scatter
snakemake -np --cores 1 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/sections/scaff_grp_0001.scaff_members.tsv --configfile .test/config/config.yaml

# for bcftools_subsamp_and_filt_gather
 snakemake -p --cores 8 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/main.bcf  --configfile .test/config/config.yaml

More notes:

awk -v id='{wildcards.bcf_id}' -f workflow/scripts/tsv_row2yaml.awk {input.bcf_tsv} > {output.info}

doesn't work great if the bcfs.tsv file has Windows line endings. Lame!

Minor revamp

I am changing the paths so that they go like this:

results/bcf/{bcf_id}/thin_0_0/main.bcf

The reason is that this will make it easier to deal with cases with and without thinning. Because then we can request files like:

results/bcf/{bcf_id}/thin_0_0/main.bcf
# and
results/bcf/{bcf_id}/thin_10000_500/main.bcf

using the same syntax, etc.

So, now the resulting directory tree from scatter-gathering the filtering of bcfs looks like this:

results/
├── bcf
│   └── TEST-Biallelic-SNPS-5-fish-maf-0.1
│       └── thin_0_0
│           ├── info
│           │   ├── bcf_stats.txt
│           │   ├── info.txt
│           │   ├── positions.tsv.gz
│           │   ├── sample_subset_file.txt
│           │   ├── samples.txt
│           │   └── scaffold_groups.tsv
│           ├── main.bcf
│           └── main.bcf.csi

Of course, having done that, we will need to constrain the {thin_int} and {thin_start} wildcards to never be "0".

Indeed! Snakemake barked an error so I now have a regex demanding that they by positive integers.

Gotta Mention Somewhere

In the bcftools stats runs that I do, I set the af-tag to show MAF, which is more informative than reference allele freq, it seems to me.
thin_start must be <= thin_int (it throws an error if not).

Where I got to by the end of today

So, I was running it through various thinnings. Let's say I want to make a bcf with 5 indivs in it and then I want to subsample the variants, taking every 6th, or 10, and for each of those, starting on 1 or 4. Then I could do this:

snakemake -np --cores 8 --use-conda results/bcf/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{6,10}_{1,4}/main.bcf  --configfile .test/config/config.yaml

which gives me this:

job                                  count    min threads    max threads
---------------------------------  -------  -------------  -------------
bcftools_subsamp_and_filt_gather         1              1              1
bcftools_subsamp_and_filt_scatter       41              1              1
bcftools_thin_gather                     4              1              1
bcftools_thin_scatter                  164              1              1
total                                  210              1              1

When I run it, this is what results/bcf looks like afterward:

results/bcf
└── TEST-Biallelic-SNPS-5-fish-maf-0.1
    ├── thin_0_0
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   ├── main.bcf.csi
    │   └── sections
    │       ├── scaff_grp_0001.positions.tsv.gz
    │       ├── scaff_grp_0002.positions.tsv.gz
    │       ├── scaff_grp_0003.positions.tsv.gz
    │       ├── scaff_grp_0004.positions.tsv.gz
    │       ├── scaff_grp_0005.positions.tsv.gz
    │       ├── scaff_grp_0006.positions.tsv.gz
    │       ├── scaff_grp_0007.positions.tsv.gz
    │       ├── scaff_grp_0008.positions.tsv.gz
    │       ├── scaff_grp_0009.positions.tsv.gz
    │       ├── scaff_grp_0010.positions.tsv.gz
    │       ├── scaff_grp_0011.positions.tsv.gz
    │       ├── scaff_grp_0012.positions.tsv.gz
    │       ├── scaff_grp_0013.positions.tsv.gz
    │       ├── scaff_grp_0014.positions.tsv.gz
    │       ├── scaff_grp_0015.positions.tsv.gz
    │       ├── scaff_grp_0016.positions.tsv.gz
    │       ├── scaff_grp_0017.positions.tsv.gz
    │       ├── scaff_grp_0018.positions.tsv.gz
    │       ├── scaff_grp_0019.positions.tsv.gz
    │       ├── scaff_grp_0020.positions.tsv.gz
    │       ├── scaff_grp_0021.positions.tsv.gz
    │       ├── scaff_grp_0022.positions.tsv.gz
    │       ├── scaff_grp_0023.positions.tsv.gz
    │       ├── scaff_grp_0024.positions.tsv.gz
    │       ├── scaff_grp_0025.positions.tsv.gz
    │       ├── scaff_grp_0026.positions.tsv.gz
    │       ├── scaff_grp_0027.positions.tsv.gz
    │       ├── scaff_grp_0028.positions.tsv.gz
    │       ├── scaff_grp_0029.positions.tsv.gz
    │       ├── scaff_grp_0030.positions.tsv.gz
    │       ├── scaff_grp_0031.positions.tsv.gz
    │       ├── scaff_grp_0032.positions.tsv.gz
    │       ├── scaff_grp_0033.positions.tsv.gz
    │       ├── scaff_grp_0034.positions.tsv.gz
    │       ├── scaff_grp_0035.positions.tsv.gz
    │       ├── scaff_grp_0036.positions.tsv.gz
    │       ├── scaff_grp_0037.positions.tsv.gz
    │       ├── scaff_grp_0038.positions.tsv.gz
    │       ├── scaff_grp_0039.positions.tsv.gz
    │       ├── scaff_grp_0040.positions.tsv.gz
    │       └── scaff_grp_0041.positions.tsv.gz
    ├── thin_10_1
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    ├── thin_10_4
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    ├── thin_6_1
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    └── thin_6_4
        ├── info
        │   ├── bcf_stats.txt
        │   ├── info.txt
        │   ├── positions.tsv.gz
        │   ├── sample_subset_file.txt
        │   ├── samples.txt
        │   └── scaffold_groups.tsv
        ├── main.bcf
        └── main.bcf.csi

I can't have the positions files of the sections of the thin_0_0 directory be temporary because they will always be useful for future thinning. So, I just kept them there. That is pretty spot on!

eriqande commented 2 years ago

Continuing with some development

I realized that the separate files of scaffold members could get reused a lot, so I put those in a dedicated directory: results/bcf/{bcf_id}/scaff_members/

Now the directory structure after our canonical run looks like this:

results/bcf
└── TEST-Biallelic-SNPS-5-fish-maf-0.1
    ├── scaff_members
    │   ├── scaff_grp_0001.scaff_members.tsv
    │   ├── scaff_grp_0002.scaff_members.tsv
    │   ├── scaff_grp_0003.scaff_members.tsv
    │   ├── scaff_grp_0004.scaff_members.tsv
    │   ├── scaff_grp_0005.scaff_members.tsv
    │   ├── scaff_grp_0006.scaff_members.tsv
    │   ├── scaff_grp_0007.scaff_members.tsv
    │   ├── scaff_grp_0008.scaff_members.tsv
    │   ├── scaff_grp_0009.scaff_members.tsv
    │   ├── scaff_grp_0010.scaff_members.tsv
    │   ├── scaff_grp_0011.scaff_members.tsv
    │   ├── scaff_grp_0012.scaff_members.tsv
    │   ├── scaff_grp_0013.scaff_members.tsv
    │   ├── scaff_grp_0014.scaff_members.tsv
    │   ├── scaff_grp_0015.scaff_members.tsv
    │   ├── scaff_grp_0016.scaff_members.tsv
    │   ├── scaff_grp_0017.scaff_members.tsv
    │   ├── scaff_grp_0018.scaff_members.tsv
    │   ├── scaff_grp_0019.scaff_members.tsv
    │   ├── scaff_grp_0020.scaff_members.tsv
    │   ├── scaff_grp_0021.scaff_members.tsv
    │   ├── scaff_grp_0022.scaff_members.tsv
    │   ├── scaff_grp_0023.scaff_members.tsv
    │   ├── scaff_grp_0024.scaff_members.tsv
    │   ├── scaff_grp_0025.scaff_members.tsv
    │   ├── scaff_grp_0026.scaff_members.tsv
    │   ├── scaff_grp_0027.scaff_members.tsv
    │   ├── scaff_grp_0028.scaff_members.tsv
    │   ├── scaff_grp_0029.scaff_members.tsv
    │   ├── scaff_grp_0030.scaff_members.tsv
    │   ├── scaff_grp_0031.scaff_members.tsv
    │   ├── scaff_grp_0032.scaff_members.tsv
    │   ├── scaff_grp_0033.scaff_members.tsv
    │   ├── scaff_grp_0034.scaff_members.tsv
    │   ├── scaff_grp_0035.scaff_members.tsv
    │   ├── scaff_grp_0036.scaff_members.tsv
    │   ├── scaff_grp_0037.scaff_members.tsv
    │   ├── scaff_grp_0038.scaff_members.tsv
    │   ├── scaff_grp_0039.scaff_members.tsv
    │   ├── scaff_grp_0040.scaff_members.tsv
    │   └── scaff_grp_0041.scaff_members.tsv
    ├── thin_0_0
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   ├── main.bcf.csi
    │   └── sections
    │       ├── scaff_grp_0001.positions.tsv.gz
    │       ├── scaff_grp_0002.positions.tsv.gz
    │       ├── scaff_grp_0003.positions.tsv.gz
    │       ├── scaff_grp_0004.positions.tsv.gz
    │       ├── scaff_grp_0005.positions.tsv.gz
    │       ├── scaff_grp_0006.positions.tsv.gz
    │       ├── scaff_grp_0007.positions.tsv.gz
    │       ├── scaff_grp_0008.positions.tsv.gz
    │       ├── scaff_grp_0009.positions.tsv.gz
    │       ├── scaff_grp_0010.positions.tsv.gz
    │       ├── scaff_grp_0011.positions.tsv.gz
    │       ├── scaff_grp_0012.positions.tsv.gz
    │       ├── scaff_grp_0013.positions.tsv.gz
    │       ├── scaff_grp_0014.positions.tsv.gz
    │       ├── scaff_grp_0015.positions.tsv.gz
    │       ├── scaff_grp_0016.positions.tsv.gz
    │       ├── scaff_grp_0017.positions.tsv.gz
    │       ├── scaff_grp_0018.positions.tsv.gz
    │       ├── scaff_grp_0019.positions.tsv.gz
    │       ├── scaff_grp_0020.positions.tsv.gz
    │       ├── scaff_grp_0021.positions.tsv.gz
    │       ├── scaff_grp_0022.positions.tsv.gz
    │       ├── scaff_grp_0023.positions.tsv.gz
    │       ├── scaff_grp_0024.positions.tsv.gz
    │       ├── scaff_grp_0025.positions.tsv.gz
    │       ├── scaff_grp_0026.positions.tsv.gz
    │       ├── scaff_grp_0027.positions.tsv.gz
    │       ├── scaff_grp_0028.positions.tsv.gz
    │       ├── scaff_grp_0029.positions.tsv.gz
    │       ├── scaff_grp_0030.positions.tsv.gz
    │       ├── scaff_grp_0031.positions.tsv.gz
    │       ├── scaff_grp_0032.positions.tsv.gz
    │       ├── scaff_grp_0033.positions.tsv.gz
    │       ├── scaff_grp_0034.positions.tsv.gz
    │       ├── scaff_grp_0035.positions.tsv.gz
    │       ├── scaff_grp_0036.positions.tsv.gz
    │       ├── scaff_grp_0037.positions.tsv.gz
    │       ├── scaff_grp_0038.positions.tsv.gz
    │       ├── scaff_grp_0039.positions.tsv.gz
    │       ├── scaff_grp_0040.positions.tsv.gz
    │       └── scaff_grp_0041.positions.tsv.gz
    ├── thin_10_1
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    ├── thin_10_4
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    ├── thin_6_1
    │   ├── info
    │   │   ├── bcf_stats.txt
    │   │   ├── info.txt
    │   │   ├── positions.tsv.gz
    │   │   ├── sample_subset_file.txt
    │   │   ├── samples.txt
    │   │   └── scaffold_groups.tsv
    │   ├── main.bcf
    │   └── main.bcf.csi
    └── thin_6_4
        ├── info
        │   ├── bcf_stats.txt
        │   ├── info.txt
        │   ├── positions.tsv.gz
        │   ├── sample_subset_file.txt
        │   ├── samples.txt
        │   └── scaffold_groups.tsv
        ├── main.bcf
        └── main.bcf.csi

eriqande commented 2 years ago

More minor changes

I used the ruleorder directive to deal with the thin_0_0 cases. Yay! It seems to work all the way through the beagle-gl files. Yay!

Here is the command that does it:

snakemake -np --cores 8 --use-conda results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/beagle-gl.gz  --configfile .test/config/config.yaml

Which gives this:

job                                  count    min threads    max threads
---------------------------------  -------  -------------  -------------
bcftools_subsamp_and_filt_gather         1              1              1
bcftools_subsamp_and_filt_scatter       41              1              1
bcftools_thin_gather                     2              1              1
bcftools_thin_scatter                   82              1              1
beagle3_glikes_from_PL_scattered       123              1              1
gather_beagle3_glikes_from_PL            3              1              1
total                                  252              1              1

I need to allow some targets that are the individuals scaff groups beagle-gls, because they are marked temp and get deleted afterward. But we might want to use them later!
also, I should change ACGT to 1234 in the beagle output

eriqande commented 2 years ago

New Day

Gonna deal with these:

Scaff-specific beagle-gl's

So, for that one, my plan is to have a rule that cats the top-row and the body of each beagle.gz file.

here, is a test invocation:

snakemake -np --cores 8 --use-conda \
     results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/beagle-gl.gz  \
    results/beagle-gl/TEST-Biallelic-SNPS-5-fish-maf-0.1/thin_{0_0,10_5,6_3}/sections/scaff_grp_000{1..9}.beagle-gl.gz  \
     --configfile .test/config/config.yaml

[x] also, I should change ACGT to 1234 in the beagle output

eriqande commented 2 years ago

Running some pcangsd

pcangsd does not seem to come with the standard angsd conda distro. Rather it can be downloaded from github and installed. It seems to get installed into whatever conda environment one is using, so I can make a rule that installs it into our pcangsd conda env.

This only runs on the linux, so we have to test it on the cluster. I am going to go ahead and test it with the data set of 454 fish. I have a new line in the bcfs.tsv file for it with ID: TEST-Biallelic-SNPS-460-fish-maf-0.05.

I can generate the beagle.gz file for it like this on the cluster with a whole node checked out:

snakemake -p --cores 20 --use-conda      results/beagle-gl/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/beagle-gl.gz --configfile .test/config/config.yaml

The resulting beagle.gz file has 3168 variants in it. This should be a decent data set for making sure I have pcangsd up and running well.

When you run pcangsd on that here is what we learn:

by default it doesn't use all the markers, so it must be computing maf and cutting off at 0.05 by default. So, we will see if we can not do that.
There is not longer a --post-save option, so if you want to get genotype posteriors, it looks like you have to get the individual allele frequencies for each marker and each individual and use those to compute expected genotype freqs and use those as priors.
The last one involves reading in some numpy arrays. the R package the recommend for doing that threw a seg fault. What a load of BS! I think I will just flatten the goddam thing in python to a single vector, write it out, and then read it into R with scan(). Some code:
```
import numpy as np
pi=np.load("TEST-PCANGSD.pi.npy")
```

to see it as a pandas df you can do:

import pandas as pd df = pd.DataFrame(pi)

to flatten into something that has all the markers for sample 1, then all the markers

for sample 2, and so on, in a single vector, you can do:

piv=pi.flatten(order='F') piv=pi.flatten(order='F') # turns it into a vector


I'll play with these are find a quick, memory-reasonable way of computing the genotype posteriors and writing them
back, out to a beagle.gz file.

eriqande commented 2 years ago

There really is no --post_save in this new cythonized version of pcangsd. Perhaps because it is too hard to write it out?

Here is where the code for calling genotypes lives:

https://github.com/Rosemeis/pcangsd/blob/master/pcangsd/shared_cy.pyx#L100-L130

And it is pretty clear what one would have to do to make that output a big flat file of genotype posteriors, I think...

eriqande commented 2 years ago

Write out Genotype Posteriors

Having looked over the code in pcangsd, it is clear that they read the beagle geno likes directly into a numpy array and they toss the information about marker names and the alleles. So, if we are going to try to create a beagle file with posteriors, we won't have the locus information. Not to worry. We will write it out as a text file and then just make another layer in our workflow to reattach that locus information (accounting for any filtered sites...)

To do this, I have made a fork of the pcangsd repository. The plan is:

add the --post_save option back in there.
add a cython function to make a numpy array of the genotype posteriors.
in the python pcangsd.py script, make a Pandas data frame out of the numpy array
write that out as a tsv, preferable compressed.

Here is what a test run on that looks like:

BEAG=/home/eanderson/Documents/git-repos/mega-post-bcf-exploratory-snakeflows/results/beagle-gl/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/beagle-gl.gz

pcangsd -b $BEAG --geno 0.7  --pi_save -t 20 --post_save --out quick-test --minMaf 0.0

I have that working now and pulled it into main on my fork. But the file writing is tremendously slow!! It is slow whether I write it with numpy.savetxt directly, or if I turn it into a Pandas data frame. It is only 254 rows with 3168 markers, or so, and it takes 16 seconds. WTF??? That would be like 4.4 hours with 3 million markers. That seems ridiculously slow. Is there a way we could just write it from cython and have it be faster.

eriqande commented 2 years ago

I updated this by using C's fprintf() function to write the array to disk. It is much faster now.

In a perfect world, I would figure out how to read the marker names and sample names in and also write that out as a proper BEAGLE file to a pipe that gzipped it. But, alas, that is too much for now.

So, instead I write it as uncompressed TSV and then I will paste the marker names and cat the header names to it.

eriqande commented 2 years ago

Now, back to the mega-post-bcf-exploratory-snakeflows

Adding pcangsd stuff in there. At this point, here is one invocation for testing:

snakemake -np --cores 20 --use-conda      results/pcangsd/TEST-Biallelic-SNPS-460-fish-maf-0.05/thin_0_0/out.gposts.tsv  --configfile .test/config/config.yaml

WTF?

This returns reasonable values for the posteriors, but the covariance matrix is all NaNs and Infs. WTF? I am wondering if pcangsd does not deal well with very small numbers in exponential notation. I've noticed that the Beagle files used as examples in various ANGSD tutorials, and in the angsd website are just floats with 6 decimal precision. (i.e. %.6f). So, I will want to try to change mine to that.

But I also should play around, first, with dropping sites with more than 50% missing data (not that that should make any difference...but I should do that anyway.

I did both of those things and also changed the alleles to 0,1,2,3 instead of 1,2,3,4, and still the cov matrix is all infs and nans.

snakemake -np --cores 20 --use-conda      results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_0_0/out.gpost.tsv  --configfile .test/config/config.yaml

Then I did a bunch of exploring around. If I filter out any site that has a geno likelihood of 0.3333 it works. However, filtering sites by F_MISSING doesn't really work.

Then I went and looked at the BCF files and there are lots of sites with 1 or 2 reads for, say, the reference allele, but the phred-scaled genotype likelihoods are still 0,0,0. What is up with that. Has the BQSR completely trashed this, or is it actually trying to use sites with really low base quality scores? Completely lame.

eriqande commented 2 years ago

I also peered through the pcangsd code. The individual allele freqs are estimated separately and before the covariance matrix. So, it is conceivable that the indiv allele freqs are being calculated correctly, and then it is borking on the covariance matrix.

I can try calculating the covariance matrix just from the ngsTools method that is implemented (iter=0) to see if that is less problematic.
I should also try checking up on the estiamated MAFs, because that might be where something is going wrong (they are used in the calculation for the cov (and it is hard to see how one could get a Nan or in Inf in that calc...

eriqande commented 2 years ago

Working along this more, it appears that covariance_cy.updateDosages isa function that computes the posterior genotype probs but then uses them to write out the dosages. So that is no real help.

We still have to figure out why the covariance matrix is borked.

I output the mafs and genotype dosages to inspect. The estimated mafs actually have some MAFs of 1.0. That seems pretty messed up. I will go back to the BCF file and check it out.

Sure enough:

NC_056432.1 47853288    .   A   C   62394   PASS    AC=48;AF=0.827586;AN=58;BaseQRankSum=0.431;DP=1622;ExcessHet=0;FS=0;InbreedingCoeff=0.3872;MLEAC=955;MLEAF=0.995;MQ=59.99;MQRankSum=0;NMISS=25;QD=27.72;ReadPosRankSum=0.967;SOR=0.23;NS=29;MAF=0.172414;AC_Het=0;AC_Hom=48;AC_Hemi=0;HWE=2.27589e-06;ExcHet=1  GT:AD:DP:GQ:PGT:PID:PL:PS   1/1:0,4:4:12:.:.:124,12,0:. 1/1:0,7:7:21:.:.:210,21,0:. 1/1:0,7:7:21:.:.:193,21,0:. 1/1:0,3:3:9:.:.:99,9,0:.    1/1:0,2:2:6:.:.:68,6,0:.    0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,6:6:18:.:.:183,18,0:. 1/1:0,3:3:9:.:.:92,9,0:.    1/1:0,9:9:27:.:.:295,27,0:. 1/1:0,4:4:12:.:.:135,12,0:. 1/1:0,2:2:6:.:.:69,6,0:.    1/1:0,2:2:6:.:.:68,6,0:.    1/1:0,4:4:12:.:.:117,12,0:. 1/1:0,2:2:6:.:.:68,6,0:.    1/1:0,2:2:6:.:.:67,6,0:.    1/1:0,2:2:6:.:.:68,6,0:.    1/1:0,5:5:15:.:.:160,15,0:. 1/1:0,3:3:9:.:.:83,9,0:.    0/0:1,0:1:0:.:.:0,0,0:. 0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,3:3:9:.:.:101,9,0:.   0/0:1,0:1:0:.:.:0,0,0:. 0/0:1,0:1:0:.:.:0,0,0:. 1/1:0,4:4:12:.:.:133,12,0:. 1/1:0,5:5:15:.:.:145,15,0:. 1/1:0,3:3:9:.:.:82,9,0:.    1/1:0,8:8:24:.:.:228,24,0:. 1/1:0,3:3:9:.:.:96,9,0:.    1/1:0,4:4:12:.:.:135,12,0:.

That lovely locus is treated by bcftools as having a minor allele frequency of 0.1724, but all of those "minor" alleles are simply in reference homozygotes that have 0,0,0 for the genotype likelihoods. That is some messed up shit.

Clearly, if I am to filter these in the wgs workflow I should use MLEAF, but, sadly that is not computed on the fly by bcftools. What a disaster.

I think that perhaps the thing to do is sample selection with bcftools, and maybe some light filtering, but no filtering on MAFs. Then use angsd to compute the allele frequencies, and get a mafs.gz file from which I could then filter the beagle file on MAF just using awk or something. Of course, angsd treats the REF as the Major and Alt as the minor in that context, so I would want to filter on (AF > maf_cutoff & AF < 0.5) | (AF > 0.5 & AF < 1 - maf_cutoff). It seems then that I could send that off to pcangsd.

eriqande commented 2 years ago

It turns out that ANGSD itself is fricking SHIT-SHOW because it uses underscores to separate chromosomes from positions, so that a chromosome or scaffold with a name like NC_03451.1 is read as chromosome of NC followed by a position.

BUT, looking at the code for pcangsd, the MAF filtering is done correctly as a MAF filter:

maskMAF = (f >= args.minMaf) & (f <= (1 - args.minMaf))

So, we could do the MAF filtering in pcangsd and then reconstitute the marker names and beagle file as necessary with some awk and paste. Phew!!

eriqande commented 2 years ago

So, it looks like in order to make this stuff work with angsd we can change underscores in the locus names to dashes. We need to do that in the .fai file as well.

For now, I will just change things in the awk script that makes the beagle.gl.gz file.

eriqande commented 2 years ago

I wrote code to bung together a beagle posteriors file from the pcangsd output and the header of the beagle-gl file.

The invocation I am working on for testing this is currently:

snakemake -np --cores 20 --use-conda      results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_0_0/maf_0.04/beagle-post.gz  --configfile .test/config/config.yaml

That is working now. Had to tweak the pipefail stuff because I was using head, etc. Lame.

eriqande commented 2 years ago

PCA on a big bunch o' fish

I am now ready to test it on, let's say close to a million SNPs.

We have about 6.5 million snps in all-pass-maf-0.05 for the all-together yukon. So, I will just throw another line in the bcfs.tsv for them.

I am doing this in:

/home/eanderson/scratch/PROJECTS/mega-post-bcf-exploratory-snakeflows

test run:

snakemake -np --cores 20 --use-conda      results/pcangsd/Mondo-454-fish-maf-0.05/thin_6_1/maf_0.04/beagle-post.gz  --configfile .test/config/config.yaml

I ran that using the sedna slurm profile and it did great. Got to the pcangsd step, and then it botched---it couldn't find pcangsd. So I restarted it after removing the pcangsd install flag, and I ran it locally on a checked out core, and it had no further problems.

eriqande commented 2 years ago

doAsso stuff

Starting to figure out the ANGSD asso stuff. Just playing around with a 10K marker set. Here is what I was doing (re-tracing steps that I did not record at the time, a week ago, before vacation...)

# this is working:
angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv  -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4

# currently in: /home/eanderson/scratch/PROJECTS/mega-post-bcf-exploratory-snakeflows/play

# here is what the files look like:
(angsd-0.937) [sedna: play]--% zcat beag-10k-post.gz | head | awk 'BEGIN {OFS="\t"} {print $1, $2, $3, $4, $5, $6, $7, $8, $9}'
marker  allele1 allele2 T199967 T199967 T199967 T199968 T199968 T199968
NC-056429.1_523 2   3   1.5930e-03  7.6638e-02  9.2177e-01  0.0000e+00  2.6336e-03  9.9737e-01
NC-056429.1_3384    1   3   0.0000e+00  9.9571e-01  4.2881e-03  0.0000e+00  1.1130e-01  8.8870e-01
NC-056429.1_9249    3   2   0.0000e+00  1.0315e-02  9.8969e-01  8.0988e-03  1.6379e-01  8.2811e-01
NC-056429.1_16498   0   2   3.5444e-03  1.1198e-01  8.8448e-01  0.0000e+00  3.2561e-02  9.6744e-01
NC-056429.1_26537   0   1   7.5160e-01  2.4840e-01  0.0000e+00  4.7456e-05  9.9995e-01  4.7189e-10
NC-056429.1_29392   0   2   1.1095e-05  9.9999e-01  7.6503e-10  8.7806e-01  1.2192e-01  2.1214e-05
NC-056429.1_34425   2   0   2.2667e-02  2.5578e-01  7.2156e-01  0.0000e+00  1.5535e-01  8.4465e-01
NC-056429.1_44032   3   1   0.0000e+00  2.9467e-02  9.7053e-01  0.0000e+00  1.3273e-02  9.8673e-01
NC-056429.1_51167   1   3   0.0000e+00  1.9507e-02  9.8049e-01  0.0000e+00  6.9427e-02  9.3057e-01

# the fai file has to have the angsd-ified chrom names, too---underscores are turned to dashes
(angsd-0.937) [sedna: play]--% head genome.fasta.fai
NC-056429.1 95801903    117 80  81
NC-056430.1 75537397    96999661    80  81
NC-056431.1 81554755    173481393   80  81
NC-056432.1 75613410    256055700   80  81
NC-056433.1 90317969    332614395   80  81
NC-056434.1 80340224    424061456   80  81
NC-056435.1 77130101    505406050   80  81
NC-056436.1 91267893    583500395   80  81
NC-056437.1 86417572    675909254   80  81
NC-056438.1 84826039    763407163   80  81

# the samples file is in this format:
# it turns out that all the covariates, including discrete ones like sex have to be integers,
# it appears.  It bombed when I used Ms and Fs.
(angsd-0.937) [sedna: play]--% head dot_samples2.tsv
ID  age sex PC1 PC2 PC3 PC4
0   B   D   C   C   C   C
T199967 1   1   0.000466737846601796    -0.0355754261982327 0.0410549697242467  -0.0956605172102598
T199968 1   0   -0.0485469374738537 0.00278541349388417 0.0693590254462809  0.0654561161382242
T199969 1   1   0.0369939568333824  -0.0710541710300985 -0.0175538262788173 0.0121487002814868
T199970 1   0   -0.0434925140427411 0.00378501130744979 0.130358439300395   0.0609703909488089
T199971 1   0   -0.0217766288507172 -0.0214415058343114 0.0763449795915192  -0.181235243692794
T199972 1   0   0.0450978210333021  -0.0309180469622904 -0.00769339107871081    0.0243643715339333
T199973 1   0   -0.0546464166951503 0.0121567360058559  -0.0774412068776019 -0.00794532933579045
T199974 1   0   -0.0491440899457988 0.0129302261721018  -0.0109094686470035 0.00193556110694826

Further questions!

After that running successfully, I find myself wondering if the dot_samples file can:

Be used to include just a subset of the samples in the whole genotype file
Have samples in a different order than in the genotypes file.

We can experiment with those two things here and now.

First, rerun what we did before:

(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv  -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4
    -> angsd version: 0.937 (htslib: 1.15.1) build(Mar  3 2022 00:51:08)
    -> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_samples2.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4
    -> No '-out' argument given, output files will be called 'angsdput'
    -> Inputtype is beagle
    -> Printing at chr: NC-056429.1 pos:34484806 chunknumber 200 contains 50 sitesDone reading beagle
    -> Done reading data waiting for calculations to finish
    -> Done waiting for threads
    -> Output filenames:
        ->"angsdput.arg"
        ->"angsdput.lrt0.gz"
    -> Sat Jul 30 08:43:20 2022
    -> Arguments and parameters for all analysis are located in .arg file
    -> Total number of sites analyzed: 10000
    -> Number of sites retained after filtering: 10000
    [ALL done] cpu-time used =  140.11 sec
    [ALL done] walltime used =  140.00 sec

Then we try it with just the first 30 individuals:

(angsd-0.937) [sedna: play]--% head -n 32 dot_samples2.tsv > first30_dot_samples.tsv
(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile first30_dot_samples.tsv  -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out first30
    -> angsd version: 0.937 (htslib: 1.15.1) build(Mar  3 2022 00:51:08)
    -> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile first30_dot_samples.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out first30
    -> Inputtype is beagle
The number of sequenced individuals (454) does not match the number of phenotypes (30)

OK, that is pretty definitive. The number of seqenced individuals has to match the number of phenotypes.

What about if they are in a different order? Will it be able to work that out? I doubt it, but we can try...

# put them in a different order than they are in the genotypes file:
(head -n 2 dot_samples2.tsv; awk 'BEGIN{OFS="\t"} NR>2 {print rand(),$0}' dot_samples2.tsv  | sort -n -b -k 1 | cut -f 2-) > dot_scrambled.tsv

# then try running it:
(angsd-0.937) [sedna: play]--% angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_scrambled.tsv  -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out scrambled
    -> angsd version: 0.937 (htslib: 1.15.1) build(Mar  3 2022 00:51:08)
    -> angsd -doMaf 4 -doAsso 4 -beagle beag-10k-post.gz -fai genome.fasta.fai -sampleFile dot_scrambled.tsv -whichPhe age -whichCov sex,PC1,PC2,PC3,PC4 -out scrambled
    -> Inputtype is beagle
    -> Printing at chr: NC-056429.1 pos:34484806 chunknumber 200 contains 50 sitesDone reading beagle
    -> Done reading data waiting for calculations to finish
    -> Done waiting for threads
    -> Output filenames:
        ->"scrambled.arg"
        ->"scrambled.lrt0.gz"
    -> Sat Jul 30 08:58:32 2022
    -> Arguments and parameters for all analysis are located in .arg file
    -> Total number of sites analyzed: 10000
    -> Number of sites retained after filtering: 10000
    [ALL done] cpu-time used =  139.58 sec
    [ALL done] walltime used =  140.00 sec

# So, that works.  Let's verify that the results are the same:
(angsd-0.937) [sedna: play]--% zcat angsdput.lrt0.gz | head
Chromosome  Position    Major   Minor   Frequency   N   LRT beta    SE  high_WT/HE/HO   emIter
NC-056429.1 523 G   T   0.949794    454 2.013597    0.637792    0.402173    0/14/363    2
NC-056429.1 3384    C   T   0.313106    454 0.134049    -0.085042   0.224399    84/42/5 1
NC-056429.1 9249    T   G   0.930452    454 4.491319    0.877984    0.375863    0/22/308    3
NC-056429.1 16498   A   G   0.939332    454 4.391099    0.869978    0.395756    0/15/310    4
NC-056429.1 26537   A   C   0.131353    454 1.801782    0.370892    0.286734    234/38/0    2
NC-056429.1 29392   A   G   0.169399    454 0.611648    0.177085    0.228759    215/30/3    1
NC-056429.1 34425   G   A   0.880862    454 1.208272    0.341601    0.308311    0/40/291    2
NC-056429.1 44032   T   C   0.943822    454 5.188045    0.963013    0.406805    0/16/327    2
NC-056429.1 51167   C   T   0.922563    454 7.556225    0.981645    0.362819    0/27/319    3
(angsd-0.937) [sedna: play]--% zcat scrambled.lrt0.gz | head
Chromosome  Position    Major   Minor   Frequency   N   LRT beta    SE  high_WT/HE/HO   emIter
NC-056429.1 523 G   T   0.949794    454 0.001338    -0.016129   0.462097    0/14/363    1
NC-056429.1 3384    C   T   0.313106    454 0.437107    -0.126122   0.192100    84/42/5 1
NC-056429.1 9249    T   G   0.930452    454 0.203409    0.170942    0.380376    0/22/308    2
NC-056429.1 16498   A   G   0.939332    454 0.092406    -0.122712   0.449788    0/15/310    1
NC-056429.1 26537   A   C   0.131353    454 1.180838    0.303640    0.294147    234/38/0    2
NC-056429.1 29392   A   G   0.169399    454 0.030809    -0.037865   0.208772    215/30/3    1
NC-056429.1 34425   G   A   0.880862    454 0.693206    -0.221533   0.293014    0/40/291    1
NC-056429.1 44032   T   C   0.943822    454 0.999333    -0.427150   0.430205    0/16/327    2
NC-056429.1 51167   C   T   0.922563    454 1.679021    -0.473920   0.381925    0/27/319    2

So, those LRT results are nowhere near the same. I suspect that ANGSD does not check the order of the samples in the two different files. Good to know.

eriqande commented 2 years ago

Beagle Slicer

Here is the command for using my beagle-slicer gawk script:

(base) /mega-post-bcf-exploratory-snakeflows/--% (main) pwd
/Users/eriq/Documents/git-repos/mega-post-bcf-exploratory-snakeflows
(base) /mega-post-bcf-exploratory-snakeflows/--% (main) mkdir -p testy

gzcat fracto-beagle-post.gz | cat .test/bcf/test.bcf.scaff_groups.tsv - | gawk -v path=testy -f workflow/scripts/beagle-slicer.awk

That seems to be working....

Incorporating Beagle Slicer into the snakeflow

Working on testing this by going for thinning by 10000. A typical file that I might want would be

# "results/pcangsd/{{bcf_id}}/thin_{{thin_int}}_{{thin_start}}/maf_{{min_maf}}/sections/{asg}-beagle-post.gz"
results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_10000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz

# so that this is a good way to dry run it:
snakemake -np --cores 20 --use-conda  results/pcangsd/TEST-Biallelic-SNPS-30-fish-maf-0.05/thin_10000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz   --configfile .test/config/config.yaml

When thinning by 10,000, there are some scaffold groups that end up with no markers. And that causes the beagle slice rule to fail. So, I did it thinning by 1000.

Oh wait! The problem might be that I am using the test data set which actually might not have representatives from all scaffold groups, or something like that. Oops.

So, try this:

 snakemake -p  --cores 20 --use-conda  results/pcangsd/Mondo-454-fish-maf-0.05/thin_5000_1/maf_0.05/sections/scaff_grp_0001-beagle-post.gz   --configfile .test/config/config.yaml

And that ran to completion.

Now, it looks like I just need a rule to run angsd doAsso on each of those.

eriqande commented 2 years ago

ANGSD doAsso

I ran the whole thing through to the end of the asso with:

snakemake -p --cores 20  --use-conda --configfile .test/config/config.yaml

The required outputs were in rule all. That was for thinning by 6. But now I want to run it on the whole data set.

That will be chronicled in the notebook at:

/Users/eriq/Documents/projects/yukon-chinookomes/005-association-round-two.Rmd

eriqande commented 2 years ago

Total Config Overhaul

I have grown completely disenchanted with configuring runs in a TSV file. There are some things that really work poorly for that!! So, I am going to:

Configure it all from a YAML
Start that YAML with a definitions section in which we define BCFS, SampleSubsets, and bcftools filtering options.
Then have an "Actions" section in the YAML in which we say what we want done and how. It uses the nicknames/keys from the definitions, and effectively will just define target output files that we want.

I am starting this on the bcf_samps_and_filt_scatter rule, so our test target can be:

snakemake --use-conda -np results/bcf_testy/filt_snps05/yc30/thin_0_0/sections/scaff_grp_0001.bcf --configfile .test/config/config.yaml

And after that worked, we go for the gather rule after that.

snakemake --use-conda -np  results/bcf_testy/filt_snps05/yc30/thin_0_0/main.bcf --configfile .test/config/config.yaml

And once that was working, I revamped the thinning rules so that we can now test with:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/{yc30,yc_males,yc_females}/thin_6_1/main.bcf --configfile .test/config/config.yaml

Then, as I started working on the stuff in formats.smk, I can use:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/{yc30,yc_males}/{thin_6_1,thin_0_0}/beagle-gl/beagle-gl.gz --configfile .test/config/config.yaml

Things are all done up through formats.smk at this point.

Now I just need to revamp the pcangsd stuff....

Working on the pcangsd stuff I have it through:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/yc30/thin_0_0/pcangsd/maf_0.05/beagle-post.gz --configfile .test/config/config.yaml

And I will check to see if I can get the sliced beagle file:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/yc30/thin_0_0/pcangsd/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config/config.yaml

I have taken this as far as I can go at the moment. I will pull it all into the main branch now.

WHAT REMAINS

I just need to add a few more things. Note that the bcf section of the config can look like this:

bcf:
  testy:
    path: ".test/bcf/test.bcf"
    genome_path: resources/test.bcf.genome.fasta
    scaff_group_path: .test/bcf/test.bcf.scaff_groups.tsv
    sample_subsets:
      yc30:
        path: .test/config/sample_subsets/yc30.txt     # 30 of the Yukon Chinook
      yc454:
        path: .test/config/sample_subsets/yc454.txt   # all 454 Yukon River Chinook
        dotsample: 
      yc_males:
        path: .test/config/sample_subsets/yc_males.txt  # just the males
      yc_females:
        path: .test/config/sample_subsets/yc_females.txt  # just the females

The key point there is that we can have a dotsample for it. This should be the dotsample with all the phenotype information and the covariate information except for the principal components.

My plan is to have a rule that will attach a certain number of PCs to that file, which can then be used in the association step.
Additionally, we need a rule that creates the angsd-specific fai file. (Replacing underscores with dashes in the chromosome names).
Also, a rule to make the PCA plots would be nice.

I think this is far enough now that I can make some PCAs of the rockfish data.

eriqande commented 2 years ago

Continuing with the Overhaul

So, I am going to require that, although multiple BCF files can be specified, they all need to apply to a single genome (fai) and thus, also, a single scaffold group file.
For now, I remove the genome_path, because really all we need at this point is the fai path.
I made a rule to make an ANGSD version of the fai file.

Now, some other major overhauls

Whittling the bcf file down to 100 individuals so it is only about 5 Mb.
Making a dot-sample file for ALL, females, and males. This will not include the PCs.
Then I will make a rule to attach those PCs.

Having overhauled the .test directory structure pretty extensively, we can now run/test this thing like this:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/{all,males,females}/thin_0_0/pcangsd/maf_0.05/sections/scaff_grp_0001-beagle-post.gz --configfile .test/config.yaml

I committed the changes and ran it on SEDNA and the above ran perfectly.

OK! I am now in a perfect position to start overhauling the do-asso stuff to make it more extensible from within the config file actions sections. This will take some thought. I also want to include plotting of manhattan plots and also the pca bi-plots, but maybe that is better left for interactive work....

At any rate, next up is:

make a rule that will attach a certain number of PCs to that file, which can then be used in the association step.

eriqande commented 2 years ago

Attaching PCs

This is in a rule called attach_PCs_to_dotsamples

Here is a target file for it:

results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv

So, we can try to give it a run:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv --configfile .test/config.yaml

That checks out on the dry-run, but I will have to run it on the cluster, cuz I haven't yet been able to compile pcangsd up on the mac.

So, on SEDNA, we try:

snakemake  -np  --use-conda  --use-envmodules results/bcf_testy/filt_snps05/all/thin_0_0/pcangsd/maf_0.05/dotsample_PCs_12.tsv --configfile .test/config.yaml

That is all working. So, now it is on to the next step.

A Config Action Setting for Association with ANGSD

This turns out to be somewhat complicated, but I think I have something worked out now.

Here is the command that will get it going:

snakemake --use-conda -np  results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/sections/scaff_grp_0001.lrt0.gz  --configfile .test/config.yaml

And I ran it on SEDNA with this:

snakemake --cores 20  --use-conda --use-envmodules  results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/sections/scaff_grp_0001.lrt0.gz  --configfile .test/config.yaml

And it was successful.

Adding the gather step for do_asso

I added that and now I can do this on SEDNA:

snakemake --cores 20  --use-conda --use-envmodules  results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/agephe_cohort_and_four_pcs_as_covariates/all-scaff-groups.lrt0.gz  --configfile .test/config.yaml

That is all working great.

Adding example configs

I am going to add a directory of example configs that will be a good place for me to store my configs that I use for particular projects, and thus keep them synced up.

Target Section of the Config

I have a function expand_targets() that reads the "targets" section of the config to request output files in rule all.

Here is how it is described:

#### PARAMETERS SECTION
# Here we can give some short names for different sets of parameters
# that we will use later for particular ultimate targets

# these are all the standard parameters in the file paths
main_params:
  standard:
    bcf: testy
    filt: snps05
    thin_spec: "0_0"
    maf: 0.05

# Here we provide different sets of parameters for particular
# analyses.
params:
  do_asso:
    age_cohort_4PCs:
      angsd_opts: " -doMaf 4 -doAsso 4 "
      whichPhe: " age "
      whichCov: " cohort,PC-1,PC-2,PC-3,PC-4 "

#### TARGETS SECTION

# here we have a shorthand for describing target outputs.
# Each output is defined by a triple: [A, B, C] that lives under
# a heading which says which kind of analysis is desired.
# where 
#   - A is a nickname from main_params
#   - B is a nickname for a subsample that is appropriate to the BCF in A
#   - C is a nickname for the parameters defined in params for the particular analysis
#
# The idea here is that each of these triples can be used to define a file name
# or possibly multiple filenames.

targets:
  do_asso:
    - ["standard", "all", "age_cohort_4PCs"]
    - ["standard", "males", "age_cohort_4PCs"]
    - ["standard", "females", "age_cohort_4PCs"]

As of this writing, I can do a dry run on the test config like this:

snakemake -np --use-conda --configfile .test/config.yaml

And it figures out from rule all that the requested outputs are:

localrule all:
    input: results/bcf_testy/filt_snps05/all/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz, results/bcf_testy/filt_snps05/males/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz, results/bcf_testy/filt_snps05/females/thin_0_0/do_asso/maf_0.05/age_cohort_4PCs/all-scaff-groups.lrt0.gz

which is pretty rad. It then figures out that it has 348 jobs to do.

Should that be in the config?

It might be better to just have the workflow read another yaml file called, say, "targets.yaml" whose location is defined in the config. With the new versions of Snakemake, any change in the config file might trigger re-runs of everything...