dblhlx / PyBSASeq

A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method
GNU General Public License v3.0
30 stars 12 forks source link

Plot explanation #10

Closed RNieuwenhuis closed 6 months ago

RNieuwenhuis commented 9 months ago

Dear,

I ran PyBSASeq using default settings and was wondering about the output, especially the plots. Please find attached the png file. PyBSASeq

I have several questions:

  1. I see in the (a) row many of my raw variants (blue) get lost in filtering (black), is this as expected? The filtering effect does look similar to Figure 1a of the PyBSASeq publication.
  2. In the (b) row I see some very large peaks that mostly emphasize the signal of the variant rich regions after filtering from the (a) row, but the red threshold line does not seem to make a useful distinction. This is different compared to Figure 1b of the PyBSASeq publication where the red line seems to distinguish large peaks from small ones.
  3. The G-statistic panel (c) shows some very interesting peaks, but I do not understand the meaning of the pink and red lines. The red line seems to be near 0 and not useful. The pink one looks like a usable threshold to me.
  4. The delta Allele frequency panel (d) shows a red line that seems to be a threshold but for the pink line it is unclear what it means. A colleague of mine used PyBSASeq a few years ago and got a rather different plot more like Figure 3 of the publication that had also negative delta. Did something change in the software over time? I could not find a commit pointing to such a thing, only that there seemed to be a WoP program previously.

Furthermore, based on the different methods shown in the plot, the significant SNP method yield many more and also quite broad peak compared to the G-statistic. The delta AF method results in many more QTL as well. What would you advise be in this case? Focus on chr3 and ch6 or should all the other peaks also have the same priority?

Looking forward to your expert advise.

Kind regards,

Ronald

dblhlx commented 9 months ago

Hi Ronald,

Thanks for using PyBSASeq.

In panel a, the blue curve indicates total SNPs while the black curve indicates the significant SNPs. If a genomic region contains a gene that participate in controlling the trait, we will see SNP enrichment (more sSNPs) in this region. Otherwise no SNP enrichment will be observed, no matter how many total SNPs this region has.

The red lines in panels b, c, and d indicate confidence intervals/threshold at the genomic region level while the magenta curves in panels c and d indicate confidence intervals/thresholds at the SNP levels. The sSNP/totalSNP ratio is a parameter at the genomic region level, so its thresholds cannot be estimated at the SNP level. It is proper to estimate confidence intervals/thresholds at the genomic region level, because we try to identify genomic regions, not a few SNP, that are associated the trait. But please ignore the red lines in panel c and d for the time being, it is not published yet. You can find the details in my new manuscript on BioRXiv (https://www.biorxiv.org/content/10.1101/2023.03.12.532308v2). The difference of panel b between your attached picture and my previous publication is that different p-values are used: alpha=0.01 was used to identify sSNPs in the real dataset while alpha=0.1 was used for the same purpose in the simulated dataset for increasing threshold in my previous publications. But alpha=0.01 was used for both the real and simulated datasets in your case. You can change the alpha value when run the script.

A peak above or beyond the confidence intervals/thresholds represents a gene that participates in controlling the phenotype. Genes contributing more to the phenotype will have higher peaks while genes contributing less to the phenotype will have lower peaks. Any phenotype is the results of interactions among many genes, so it is OK that tens/hundreds peaks are identified in a BSA-Seq study.

Please let me know if you have any question.

Best,

Jianbo