Phylo pipeline example not working

Hi,

Can you attach your '000_makefile.yaml' file? The error you're getting suggests that you have edited/entered the information in the wrong place, but it is impossible for me to tell what exactly has gone wrong from the error message itself.

Best regards, Mikkel

On Fri, Nov 30, 2018 at 3:47 PM teepean notifications@github.com wrote:

Hello,

I have been experimenting with PALEOMIX and have had success with BAM pipeline.

I decided to run phylogenetic pipeline but at the step $phylo_pipeline genotype+msa+phylogeny 000_makefile.yaml

I get an error message:

Reading makefile(s): Error reading makefiles: MakefileError: Makefile requirement not met at 'root:chimpanzee': Expected value(s): key in 'Project', 'PhylogeneticInference', 'Genotyping', 'PAML', or 'MultipleSequenceAlignment' Observed value(s): 'chimpanzee' Observed type: str

Has something changed since the example was made or what might be causing the issue?

Here's the produced Makefile:

`# -- mode: Yaml; -- Project: Title: ExampleProject List of samples to be included in the analytical steps, which may be grouped using any arbitrary number of (levels of) groups. (Sub)groups are not required, but may be used instead of listing individual samples in 'ExcludeSamples' and 'FilterSingletons'.

Samples: : bonobo: Sex: NA chimpanzee: Sex: NA gorilla: Sex: NA rCRS: Sex: NA GenotypingMethod: Reference Sequence sumatran_orangutan: Sex: NA white_handed_gibbon: Sex: NA Specifies a set of regions of interest, each representing one or more named regions in a reference sequence (e.g. genes) in BED format.

RegionsOfInterest: protein_coding.CDS:

Name of the prefix; is expected to correspond to the filename

of the FASTA file without the extension / the name of the

prefix used in the BAM pipeline.

Prefix: rCRS

If true, BAM files are expected to have the postfix ".realigned";

allows easier interopterability with the BAM pipeline.

Realigned: yes

Specifies whether or not the sequences are protein coding; if true

indels are only included in the final sequence if the length is

divisible by 3.

ProteinCoding: yes

Do not include indels in final sequence; note that indels are still

called, and used to filter SNPs. Requires that the option

'MultipleSequenceAlignment' is enabled

IncludeIndels: yes

List of contigs for which heterozygous SNPs should be filtered

(site set to 'N') based on sex; All sexes used in the 'Samples'

section must be listed:

HomozygousContigs: NA:

NC_012920_1 Filter sites in a sample, replacing any nucleotide not observed in the specified list of samples or groups with 'N'. FilterSingletons: NAME_OF_SAMPLE: - - NAME_OF_SAMPLE

Genotyping: Default settings for all regions of interest

Defaults:

Regions of interest are expanded by this number of bases when calling

SNPs, in order to ensure that adjacent indels can be used during

filtering

(VCF_filter --min-distance-to-indels and --min-distance-between-indels).

The final sequences does not include the padding.

Padding: 10

By default, each set of regions of interest are genotyped seperately,

even if these overlap. By setting this option to true, the entire prefix

is genotyped once, and all regions of interest are extracted from this.

This can only be done for prefixes that only use genotyping defaults.

GenotypeEntirePrefix: no

Settings for genotyping by random sampling of nucletoides at each site

Random:

Min distance of variants to indels

--min-distance-to-indels: 2

MPileup: -E: # extended BAQ for higher sensitivity but lower specificity -A: # count anomalous read pairs

BCFTools: -g: # Call genotypes at variant sites

VCF_Filter:

Maximum coverage acceptable for genotyping calls; if set to zero, the

default vcf_filter value is used; if set to 'auto', the MaxDepth value

will be read from the depth histograms generated by the BAM pipeline.

MaxReadDepth: auto

Minimum coverage acceptable for genotyping calls

--min-read-depth: 6

Min RMS mapping quality

--min-mapping-quality: 10

Min QUAL score (Phred) for genotyping calls

--min-quality: 30

Min distance of variants to indels

--min-distance-to-indels: 2

Min distance between indels

--min-distance-between-indels: 10

Min P-value for strand bias (given PV4)

--min-strand-bias: 1.0e-4

Min P-value for baseQ bias (given PV4)

--min-baseq-bias: 1.0e-4

Min P-value for mapQ bias (given PV4)

--min-mapq-bias: 1.0e-4

Min P-value for end distance bias (given PV4)

--min-end-distance-bias: 1.0e-4

Max frequency of the major allele at heterozygous sites

--min-allele-frequency: 0.2

Minimum number of alternative bases observed for variants

--min-num-alt-bases: 2

Add / overwrite default settings for a set of regions NAME_OF_REGIONS: ...

MultipleSequenceAlignment: Default settings for all regions of interest

Defaults: Enabled: yes

Multiple sequence alignment using MAFFT

MAFFT:

Select alignment algorithm; valid values are 'mafft', 'auto', 'fft-ns-1',

'fft-ns-2', 'fft-ns-i', 'nw-ns-i', 'l-ins-i', 'e-ins-i', and 'g-ins-i'.

Algorithm: G-INS-i

Parameters for mafft algorithm; see above for example of how to specify

--maxiterate: 1000

Add / overwrite default settings for a set of regions NAME_OF_REGIONS: ...

PhylogeneticInference: ProteinCodingGenes:

Exclude (groups of) samples from this analytical step

ExcludeSamples: - - NAME_OF_SAMPLE

Root the final tree(s) on one or more samples; if no samples

are specified, the tree(s) will be rooted on the midpoint(s)

RootTreesOn: - - NAME_OF_SAMPLE

If 'yes', a tree is generated per named sequence in the areas of

interest; otherwise a super-matrix is created from the combined set

of regions specfied below.

PerGeneTrees: no

Which Regions Of Interest to build the phylogeny from.

RegionsOfInterest: protein_coding.CDS:

Partitioning scheme for sequences: Numbers specify which group a
 # position belongs to, while 'X' excludes the position from the final
 # partioned sequence; thus "123" splits sequences by codon-positions,
 # while "111" produces a single partition per gene. If set to 'no',
 # a single partition is used for the entire set of regions.
 Partitions: "112"
 # Limit analysis to a subset of a RegionOfInterest; subsets are expected to be
 # located at <genome root>/<prefix>.<region name>.<subset name>.names, and
 # contain single name (corresponding to column 4 in the BED file) per line.
SubsetRegions: SUBSET_NAME

ExaML:

Number of times to perform full phylogenetic inference

Replicates: 1

Number of bootstraps to compute

Bootstraps: 100

Model of rate heterogeneity (GAMMA or PSR)

Model: GAMMA

`

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MikkelSchubert/paleomix/issues/20, or mute the thread https://github.com/notifications/unsubscribe-auth/ACTMawJ80Kex7JpFkIjNteSH24_yqRBlks5u0UT2gaJpZM4Y7zKh .

MikkelSchubert / paleomix

Phylo pipeline example not working #20

Name of the prefix; is expected to correspond to the filename

of the FASTA file without the extension / the name of the

prefix used in the BAM pipeline.

If true, BAM files are expected to have the postfix ".realigned";

allows easier interopterability with the BAM pipeline.

Specifies whether or not the sequences are protein coding; if true

indels are only included in the final sequence if the length is

divisible by 3.

Do not include indels in final sequence; note that indels are still

called, and used to filter SNPs. Requires that the option

'MultipleSequenceAlignment' is enabled

List of contigs for which heterozygous SNPs should be filtered

(site set to 'N') based on sex; All sexes used in the 'Samples'

section must be listed:

Regions of interest are expanded by this number of bases when calling

SNPs, in order to ensure that adjacent indels can be used during

(VCF_filter --min-distance-to-indels and --min-distance-between-indels).

The final sequences does not include the padding.

By default, each set of regions of interest are genotyped seperately,

even if these overlap. By setting this option to true, the entire prefix

is genotyped once, and all regions of interest are extracted from this.

This can only be done for prefixes that only use genotyping defaults.

Settings for genotyping by random sampling of nucletoides at each site

Min distance of variants to indels

Maximum coverage acceptable for genotyping calls; if set to zero, the

default vcf_filter value is used; if set to 'auto', the MaxDepth value

will be read from the depth histograms generated by the BAM pipeline.

Minimum coverage acceptable for genotyping calls

Min RMS mapping quality

Min QUAL score (Phred) for genotyping calls

Min distance of variants to indels

Min distance between indels

Min P-value for strand bias (given PV4)

Min P-value for baseQ bias (given PV4)

Min P-value for mapQ bias (given PV4)

Min P-value for end distance bias (given PV4)

Max frequency of the major allele at heterozygous sites

Minimum number of alternative bases observed for variants

Multiple sequence alignment using MAFFT

Select alignment algorithm; valid values are 'mafft', 'auto', 'fft-ns-1',

'fft-ns-2', 'fft-ns-i', 'nw-ns-i', 'l-ins-i', 'e-ins-i', and 'g-ins-i'.

Parameters for mafft algorithm; see above for example of how to specify

Exclude (groups of) samples from this analytical step

Root the final tree(s) on one or more samples; if no samples

are specified, the tree(s) will be rooted on the midpoint(s)

If 'yes', a tree is generated per named sequence in the areas of

interest; otherwise a super-matrix is created from the combined set

of regions specfied below.

Which Regions Of Interest to build the phylogeny from.

Partitioning scheme for sequences: Numbers specify which group a

Number of times to perform full phylogenetic inference

Number of bootstraps to compute

Model of rate heterogeneity (GAMMA or PSR)