dblhlx / PyBSASeq

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method
GNU General Public License v3.0
30 stars 12 forks source link

does the order of parents/bulks matter? #11

Closed adriludwig closed 6 months ago

adriludwig commented 8 months ago

Hi! It's me again! I finally have the variant call against the genome of one of the parents. If I understood well I need to add a parameter --parent parent_genome.fasta, right? Then my question is, does it matter the order of samples in the tsv file? Should the high/low order be the same in parent and bulks tsv? Thanks so much!

dblhlx commented 8 months ago

I used the dataset from the link https://www.ebi.ac.uk/ena/browser/view/PRJEB27629 for code testing. The genomes of both parents were sequenced in this study. I first performed variant calling using GATK4 and the two parental sequencing files, and then convert the resulting vcf file to a tsv file. I normally name this file as parents.tsv, but you can use whatever name you like. Then I do the same for the two bulk sequencing files and generate the bulks.tsv file. Theoretically, the SVs (structural variants like SNPs and InDels) in parents.tsv should be the same as those in bulks.tsv, but there always is some difference between the two tsv files. I use the common SV dataset for BSA-Seq data analysis to filter out some noises. The pyBSASeq.py script can generate the common SV dataset if two input tsv files are provided.

If you sequenced both of the parental lines, please do the same as described above. Then use the command below to perform BSA-Seq data analysis: python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha The file parents.tsv should be placed before bulks.tsv.

If you sequenced the genomes of only the bulks, please use the command below to perform BSA-Seq data analysis: python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha

If you sequenced the genomes of both of the bulks and one of the parents, please use the command below to perform BSA-Seq data analysis: python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha Code modification of pyBSASeq.py is needed to take advantage of the single parental sequencing data in this case.

The --parent option is for a very special situation. You can use this option only if the conditions below are satisfied:

  1. For a specific species, the sequence of one of its line is used as the reference genome sequence for sequence alignment in variant calling.
  2. Such a line is one of the parent of the bulks.

You can find an example from this link https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068433#s2. So this option is not useful for most people. The usage is as below if the above conditions are satisfied: --parent ref

adriludwig commented 8 months ago

Hi! Thanks for all the explanations! I have parents and bulks, so I'm using the first command with the --parent option since one of the parents has the genome assembled and was used as a reference in the read mapping and variant calling (meaning that all alleles from this parent will be REF after filtering?). But I'm wondering if the order in parents and bulk appears in the tsv will influence the results (I mean the direction of the picks). I'm asking because I got the same topology running this command or running the data that was mapped and called against a third ref genome. For example, I have parent_high parent_low in the parents.tsv and bulk_low bulk_high in the bulk.tsv. Should I have a specific order in these tables? I'm having difficulties interpreting what is the meaning of the picks over 0.1 or below -0.1 in the deltaAF. Does this mean that alleles in bulk2 are associated with one parent in one region and another parent in another region? I apologize if these questions seem naive. Thank you for your attention and patience.

dblhlx commented 8 months ago

The order of the parents in the parents.tsv doesn't affect the data analysis results. The deltaAF is calculated as below: deltaAF = AF_bulk2 - AF_bulk1 So you will get the mirror image for the deltaAF curve (peaks become valleys and valleys turn into peaks) if you change the order of the bulks in the bulks.tsv file. However, this will not affect the BSA-Seq data analysis results.

adriludwig commented 8 months ago

Thanks. I did some tests and also tried to go through your script, so if I understood: parent 1 (if not provided) will be the first one in the tsv and is considered as REF; if for a certain SNP parent1 doesn't have the REF allele, the SNP REF/ALT info will be swapped in the bulk tsv. As I have the "high" parent as the first, "high" alleles are REF, "low" alleles are ALT. So, I'll expect for a correct association of geno/pheno that the "high" bulk (that is my bulk 1) is enriched with REF alleles while "low" bulk is enriched with ALT alleles. So, with the order of parents/bulk I have, the deltaAF I need are the positive ones.

dblhlx commented 8 months ago

When answer your questions, I assume that you did the followings:

  1. You have high-quality sequencing data for one of the parents.
  2. When performing BSA-Seq, you sequenced the genomes of both parents and both bulks.
  3. You performed variant calling for the parents using the sequencing data of parent 1 and parent 2 in (2) and using the sequence in (1) as the reference sequence. You got a file named parents.tsv (any other name is fine) at the end of this step.
  4. You performed variant calling for the bulks using the sequencing data of bulk 1 and bulk 2 in (2) and using the sequence in (1) as the reference sequence. You got a file named bulks.tsv (any other name is fine) at the end of this step.

If the above assumptions are true, the command below should be used for your data analysis. python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha

The file parents.tsv servers two purpose here:

  1. For identifying the common variant set between the parents and the bulks. Doing so would improve the reliability of variant calling and remove some noises in the plots.
  2. For performing variant swap when necessary. The script PyBSASeq.py uses the first parent in parents.tsv as the standard for variant swap.

It is not needed to use the argument --parent ref in this case. The results should be the same as above if you use the command below: python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha --parent ref

The argument --parent ref is necessary when you sequenced only the bulks in (2) and you used the genome sequence of one of the parent as the reference sequence when performing variant calling. In this case, the command below should be used. python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha --parent ref The argument --parent ref just tells the script that no variant swapping is needed because the genome sequence of one of the parents was used as the reference sequence in variant calling. You can compare the results of this command to those of the previous commands. They should be very similar.

If the genome sequence of any other line is used as the reference sequence in variant calling and only the bulks, not the parents, are sequenced in BSA-Seq, the command below should be used. python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha In this case, the absolute deltaAF of each variant is used to calculate the deltaAF value of a genomic region when plotting the deltaAF curve.