Jerrythafast / FDSTools

Data analysis tools for Massively Parallel Sequencing of forensic DNA markers, including tools for characterisation and filtering of PCR stutter artefacts and other systemic noise, and for automatic detection of the alleles in a sample.
GNU General Public License v3.0
5 stars 3 forks source link

How to configure the settings file for the DYS389II locus? #3

Open siyao opened 2 weeks ago

siyao commented 2 weeks ago

Hi Jerry,

I am using fdstools v2.1.0, and while analyzing the DYS389II locus, I set up flank and repeat parameters. However, the analysis results are mixed up with DYS389I, making it impossible to identify the true genotyping results. Additionally, the N48 in the DYS389 sequence structure [TAGA]a [CAGA]b N48[TAGA]c[CAGA]d was not recognized. Below are my library file, input.ini and the result files. file.zip

I am using the command fdstools pipeline input.ini.

Could you offer some advice on how to best configure the settings for the DYS389II locus? Thank you.

Siyao

Jerrythafast commented 2 weeks ago

Hi Siyao,

Short answer: the solution is to remove the [repeat] definition for DYS389II from your library file. FDSTools will then use ATAG and ACAG to shorten the repeats, therefore I recommend to also adjust the [genome_position] to target these repeats. In fact, this is everything you need in your library file (the other sections are unnecessary):

[genome_position]
DYS389I  = Y,12500447,12500494
DYS389II = Y,12500447,12500610

Long answer and explanation: For almost all STR targets on the human genome (including DYS389) FDSTools (technically, STRNaming) has a built-in repeat definition, so you usually don't need to use the [repeat] section in your library file. This built-in bracketing mechanism is recommended by the DNA commission of the ISFG in this paper. The [repeat] section should be used to provide a repeat definition for a non-human target or for those rare instances where no built-in definition exists yet. The only current limitation is that the [repeat] section does not recognize repeat interruptions of more than 20 nucleotides (like the N48 in DYS389II). The [flanks] and [block_length] sections exist for similar reasons.

Kind regards, Jerry

siyao commented 2 weeks ago

Okay, I see. Another question, if I provide the flanking sequence, fdstools can directly detect the SNPs in the flanking sequence?

Jerrythafast commented 2 weeks ago

Yes, that's also perfectly possible. For example, the built-in library for the ForenSeq DNA Signature Prep kit contains these somewhat larger ranges so that it can also detect and report sequence variation outside of the main ATAG[n]ACAG[m] repeat:

[genome_position]
DYS389I     =  Y,  12500387,  12500513
DYS389II    =  Y,  12500436,  12500633

It will apply bracketing to some additional (secondary) repeats in this locus, but any sequence variation beyond that will be reported as SNP calls that look like +12A>G.

Jerry

siyao commented 2 weeks ago

Jerry,thank you for your answer.