Closed adriludwig closed 6 months ago
I used the dataset from the link https://www.ebi.ac.uk/ena/browser/view/PRJEB27629 for code testing. The genomes of both parents were sequenced in this study. I first performed variant calling using GATK4 and the two parental sequencing files, and then convert the resulting vcf file to a tsv file. I normally name this file as parents.tsv, but you can use whatever name you like. Then I do the same for the two bulk sequencing files and generate the bulks.tsv file. Theoretically, the SVs (structural variants like SNPs and InDels) in parents.tsv should be the same as those in bulks.tsv, but there always is some difference between the two tsv files. I use the common SV dataset for BSA-Seq data analysis to filter out some noises. The pyBSASeq.py script can generate the common SV dataset if two input tsv files are provided.
If you sequenced both of the parental lines, please do the same as described above. Then use the command below to perform BSA-Seq data analysis:
python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha
The file parents.tsv should be placed before bulks.tsv.
If you sequenced the genomes of only the bulks, please use the command below to perform BSA-Seq data analysis:
python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha
If you sequenced the genomes of both of the bulks and one of the parents, please use the command below to perform BSA-Seq data analysis:
python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha
Code modification of pyBSASeq.py is needed to take advantage of the single parental sequencing data in this case.
The --parent
option is for a very special situation. You can use this option only if the conditions below are satisfied:
You can find an example from this link https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0068433#s2. So this option is not useful for most people. The usage is as below if the above conditions are satisfied:
--parent ref
Hi! Thanks for all the explanations! I have parents and bulks, so I'm using the first command with the --parent option since one of the parents has the genome assembled and was used as a reference in the read mapping and variant calling (meaning that all alleles from this parent will be REF after filtering?). But I'm wondering if the order in parents and bulk appears in the tsv will influence the results (I mean the direction of the picks). I'm asking because I got the same topology running this command or running the data that was mapped and called against a third ref genome. For example, I have parent_high parent_low in the parents.tsv and bulk_low bulk_high in the bulk.tsv. Should I have a specific order in these tables? I'm having difficulties interpreting what is the meaning of the picks over 0.1 or below -0.1 in the deltaAF. Does this mean that alleles in bulk2 are associated with one parent in one region and another parent in another region? I apologize if these questions seem naive. Thank you for your attention and patience.
The order of the parents in the parents.tsv doesn't affect the data analysis results. The deltaAF is calculated as below:
deltaAF = AF_bulk2 - AF_bulk1
So you will get the mirror image for the deltaAF curve (peaks become valleys and valleys turn into peaks) if you change the order of the bulks in the bulks.tsv file. However, this will not affect the BSA-Seq data analysis results.
Thanks. I did some tests and also tried to go through your script, so if I understood: parent 1 (if not provided) will be the first one in the tsv and is considered as REF; if for a certain SNP parent1 doesn't have the REF allele, the SNP REF/ALT info will be swapped in the bulk tsv. As I have the "high" parent as the first, "high" alleles are REF, "low" alleles are ALT. So, I'll expect for a correct association of geno/pheno that the "high" bulk (that is my bulk 1) is enriched with REF alleles while "low" bulk is enriched with ALT alleles. So, with the order of parents/bulk I have, the deltaAF I need are the positive ones.
When answer your questions, I assume that you did the followings:
If the above assumptions are true, the command below should be used for your data analysis.
python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha
The file parents.tsv servers two purpose here:
It is not needed to use the argument --parent ref
in this case. The results should be the same as above if you use the command below:
python pyBSASeq.py -i parents.tsv,bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha --parent ref
The argument --parent ref
is necessary when you sequenced only the bulks in (2) and you used the genome sequence of one of the parent as the reference sequence when performing variant calling. In this case, the command below should be used.
python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha --parent ref
The argument --parent ref
just tells the script that no variant swapping is needed because the genome sequence of one of the parents was used as the reference sequence in variant calling. You can compare the results of this command to those of the previous commands. They should be very similar.
If the genome sequence of any other line is used as the reference sequence in variant calling and only the bulks, not the parents, are sequenced in BSA-Seq, the command below should be used.
python pyBSASeq.py -i bulks.tsv -p popstrct -b fbsize,sbsize -v alpha,smalpha
In this case, the absolute deltaAF of each variant is used to calculate the deltaAF value of a genomic region when plotting the deltaAF curve.
Hi! It's me again! I finally have the variant call against the genome of one of the parents. If I understood well I need to add a parameter --parent parent_genome.fasta, right? Then my question is, does it matter the order of samples in the tsv file? Should the high/low order be the same in parent and bulks tsv? Thanks so much!