Rare Disease Analysis - Githubissues

mincej commented 1 year ago

Hello Dr. Strobes, Very impressive method! I have a couple of questions regarding the Watershed tool. From the associated publication, the Watershed model was trained on GTEx data, evaluated using hold-out samples and N2 pairs from GTEx, and the variants prioritized by GTEx-trained posterior probability thresholds were analyzed for replication in an independent dataset (ASMAD).

My question is: How does this model handle variants that it has not "seen before"? Such as those that are not previously seen in GTEx?

The original publication and a follow-up analysis seem to investigate complex traits, but I'm interested if this tool has been evaluated in the extreme cases where an individual presents with a rare disease, with potential VUS that are unannotated or lack sufficient annotation.

Thanks for any help!

BennyStrobes commented 1 year ago

Hi,

The answer depends on whether your question is:

How does Watershed handle variants it has not seen before in a data set of paired WGS and transcriptomic data?
How does Watershed handle variants it has not seen before in a data set of just WGS?

If your question is 1 (which is the case in the ASMAD cohort), this is simple. One simply has to input into a trained Watershed model the genomic annotation describing the rare variant and the outlier status of the nearby genes in the individual with the rare variant. See the scripts in predict_watershed.R for more details.

If your question is 2 (which I believe it is), we have not evaluated how Watershed would do in this setting relative to existing methods (such as CADD, etc). Watershed was really not developed for this setting. It was developed for the case where you have paired transcriptomic and WGS data. However, it would theoretically be possible to do. You could train a Watershed model on GTEx individuals, and simply predict P(Z|G) instead of the standard posterior of P(Z|G,E). Z is the latent variable, G is the vector of genomic annotations, and E is the outlier status.

Happy to discuss more.

mincej commented 1 year ago

Hello Dr. Strobes, Thank you for the reply! I think my question was more based around scenario 1., but I feel that I did not provide sufficient detail.

In particular, we're looking at the application of paired genomic and RNA-seq data in the context of clinical assessment of a rare disease case, as opposed to cohort-based analysis. I think a better or more pointed question would be: is the performance of WATERSHED dependent on the quality and breadth of the genomic annotations available for the variant of interest?

Taking the example genomic input from Supplementary Table 3 of the publication. If our variant is deep-intronic and is missing adequate annotation from many of the metrics tested (VEP, presence in GTEX), how is the confidence of the WATERSHED prediction affected? As an unlikely and extreme case, what would be expected in the instance of RNA-seq outlier data with minimal genomic annotation of a variant of unknown significance and an unknown functional consequence?

Thanks for any clarification!

BennyStrobes / Watershed

Rare Disease Analysis #6