RCollins13 / WGD

A suite of tools to evaluate dosage in whole-genome sequencing libraries
MIT License
8 stars 7 forks source link

Request on the 6F matrix (full path to 6F metadata matrix for that chromosome) and input question for cnMOPS_workflow.sh #7

Closed qinqian closed 4 years ago

qinqian commented 5 years ago

Thanks for releasing such a good repository. I am trying to develop a structural variation calling and annotation pipeline following your paper (An open resource of structural variation for medical and population genetics) using WDL and our PBS cluster, however, I stuck in some details in the SV calling part, specifically for CNV part now.

First, the 6F metadata matrices for multiCorrection.R are not available, and this Rscript needs four column instead of three. Second, cnMOPS_workflow.sh is not mentioned in the readme, are the input binCov matrix for this script from the raw output of binCov.py, or the output of the multiCorrection.R, or later output from estimatePloidy.R?

In my understanding(sorry if it is incorrect), the steps of ploidy estimation and dosage scoring model are to help us predict the gender and PCR status for sample batching information. If our data already has gender and sequencing batch information, can we skip these steps and only run cnMOPS_workflow.sh?

RCollins13 commented 5 years ago

Hi,

Thanks for your interest in using the codebase for your own projects!

In response to your questions:

  1. The documentation for the WGD repo is a bit out of date: the 6F correction is no longer a component of our best-practices pipeline. It may be reintroduced at a later time, but currently is not recommended.

  2. The input binCov matrix to cnMOPS_workflow.sh is the combined output of binCov.py across all samples in your cohort or batch. The raw binCov outputs can be combined into a single matrix using makeMatrix.sh in the WGD repo.

  3. The purpose of the dosage scoring and ploidy estimation steps are indeed used primarily to infer sample sex and PCR status. Depending on the size of your cohort (say < ~500 samples, although this is an approximate recommendation) batching may not be necessary, in which case you could proceed directly to cnMOPS_workflow.sh like you suggest. However, if your cohort is larger than ~500 samples, batching is recommended, and the batching procedure we used in the gnomAD callset does require dosage scoring and policy estimation.

Thanks, Ryan

On Jun 7, 2019, at 12:02 AM, QinQian notifications@github.com wrote:

Thanks for releasing such a good repository. I am trying to develop a structural variation calling and annotation pipeline following your paper (An open resource of structural variation for medical and population genetics) using WDL and our PBS cluster, however, I stuck in some details in the SV calling part, specifically for CNV part now.

First, the 6F metadata matrices for multiCorrection.R are not available, and this Rscript needs four column instead of three. Second, cnMOPS_workflow.sh is not mentioned in the readme, are the input binCov matrix for this script from the raw output of binCov.py, or the output of the multiCorrection.R, or later output from estimatePloidy.R?

In my understanding(sorry if it is incorrect), the steps of ploidy estimation and dosage scoring model are to help us predict the gender and PCR status for sample batching information. If our data already has sequencing batch information from illumina, can we skip these steps and only run cnMOPS_workflow.sh?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RCollins13/WGD/issues/7?email_source=notifications&email_token=AB4MDRC5KDU6RNH5QMXPICLPZHMWVA5CNFSM4HVP7KT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GYFH6TQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AB4MDRBNLT3LPBR464Q4CWLPZHMWVANCNFSM4HVP7KTQ.

qinqian commented 5 years ago

Hi Ryan,

Thank you for your detailed answers. Our cohort are constituted with < 400 samples, so we'll try to call without the batching procedure. We are looking forward to your later updates of the best-practices pipeline.

Best, Qian

zhouwzfw commented 4 years ago

Hi Ryan,

Thank you for your great work on structural variation identification on whole-genome sequencing data. I am also trying to control the dosage bias following your paper (An open resource of structural variation for medical and population genetics). In the paper, you developed WGD model to quantify the dosage bias and provided 3,202 bins as a public resource. However, I only find the WGD_scoring_mask.6F_adjusted.100bp.hg19.bed and WGD_scoring_mask.rawCov.100bp.hg19.bed in the refs directory. But their number of bins is not 3,202. If 6F coverage correction is not recommended currently, can I use the bins and the weights of WGD_scoring_mask.6F_adjusted.100bp.hg19.bed to calculate the dosage score based on the row bin coverage? Thank you very much in advance.

Best,

Weizhen

RCollins13 commented 4 years ago

Hi @zhouwzfw,

Good question, and my apologies this isn't documented clearer. The 556 bins in WGD_scoring_mask.rawCov.100bp.hg19.bed are a subset of the 3,202, which have been load-balanced so the average score should be approximately ~0. In practice, we use this mask for dosage scoring on hg19/GRCh37, so I would recommend using these bins and weights for your samples.

Thanks, Ryan