Closed lauragails closed 1 year ago
Good to hear that process_smrtcells
and process_samples
are running well for you!
The cohort.yaml
file is only used by process_cohort
so no worries about changing it.
With regard to cohort size, batch effect is not a concern for this workflow so you should select cohorts based on your research goal. Most often we run singletons, trios, duos, and quads. Multi-sample cohorts are beneficial if you want to jointly call variants, which you only really care about when you're examining inheritance or if you want some sort of population statistics for variants like allele frequency. If you're interested in running big cohorts for population statistics, it's important to note that joint calling in pbsv sometimes fails with large sample sizes (100-200+ samples ).
In short, I'd recommend running your samples based on their relationship to other samples you've sequenced, such as singletons, trios, duos, etc.
Ok! To clarify for my experimental design, I have ~12 singletons and one trio. Right now I have that being run as two cohorts, one with the singletons and one trio.
You would not recommend that I make one single cohort, while specifying parental relation for the trio?
Thank you!
Personally, I would specify 13 cohorts here (12 singletons with no relationship info and 1 trio with parental relations specified).
The only situation that I would combine them all into one cohort is if you were studying some gene or set of genes that contribute to a phenotype that is shared among all "affected" samples in the cohort.
I see - well they all have the same syndrome, but sample size is too small to make any definite statements. Will runtime (or memory needs) be drastically impacted if I run all of the singletons together?
On Fri, Jan 6, 2023, 6:17 PM Juniper A. Lake @.***> wrote:
Personally, I would specify 13 cohorts here (12 singletons with no relationship info and 1 trio with parental relations specified).
The only situation that I would combine them all into one cohort is if you were studying some gene or set of genes that contribute to a phenotype that is shared among all "affected" samples in the cohort.
— Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake/issues/149#issuecomment-1374246424, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVNCLOCSR5UUZOGI2K4ZWLWRCRYXANCNFSM6AAAAAATTILPX4 . You are receiving this because you authored the thread.Message ID: <PacificBiosciences/pb-human-wgs-workflow-snakemake/issues/149/1374246424@ github.com>
Yes, runtime and memory use will probably be higher but not excessively. Runtime isn't very long on process_cohort unless you have trio assembly running. Since this cohort would be all singletons, I don't expect computational requirements to be an issue at all.
great! Then that is what I will do. Thank you, Juniper!
Having issues running:
[truncated]/workflow/process_cohort.local.sh TRIO
Errors
Config file workflow/reference.yaml is extended by additional config specified via the command line.
Config file workflow/config.yaml is extended by additional config specified via the command line.
FileNotFoundError in line 15 of [truncated]/workflow/process_cohort.smk:
[Errno 2] No such file or directory: 'cohort.yaml'
File "[truncated]/workflow/process_cohort.smk", line 55, in <module>
File "[truncated]/workflow/process_cohort.smk", line 15, in get_samples
Processing cohort TRIO with reference GRCh38.
It appears that the yaml files are being read given the "Processing..." part. Thank you!
The "Processing..." part is just a print statement and doesn't indicate that yaml files are being read correctly. It just indicates that the name of the cohort ("TRIO" in this case) was read from the command line. The problem here is that your cohort.yaml
file isn't in the correct place. It should be in your working directory. The location of this file can be configured in workflow/config.yaml
under the entry cohort_yaml
.
I see - I edited that path to read "workflow/config.yaml" and am now able to proceed.
However I am hitting another error
Config file workflow/reference.yaml is extended by additional config specified via the command line.
Config file workflow/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
MissingInputException in line 4 of /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/pacbio_wgs/121422/workflow/rules/cohort_svpack.smk:
Missing input files for rule svpack_filter_annotated:
resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz
Processing cohort PROBAND_COHORT with reference GRCh38.
Samples in cohort: ['S1', 'S2',...etc].
No trios found in cohort.
(I put the trio and singletons in two different cohorts)
How can we fix this rule issue? Thank you, Juniper!
The error message tells you which file is missing: resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz
. Please check that the resources folder is in the correct location and contains all necessary files. If necessary, re-download the resources folder using the FTP2 information that was originally provided to you by Billy.
It's not in the folder that I have access to (see screenshot)
I don't think it's there? Here are the vcf files in resources/hprc
~5m ago
hprc.GRCh38.deepvariant.vcf.gz
hprc.GRCh38.deepvariant.vcf.gz.tbi
hprc.GRCh38.pbsv.vcf.gz
hprc.GRCh38.pbsv.vcf.gz.tbi
README
It's hprc.GRCh38.pbsv.vcf.gz
.
Thanks @williamrowell
In workflow/reference.yaml
, change hprc_pbsv_vcf: 'resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz'
to hprc_pbsv_vcf: 'resources/hprc/hprc.GRCh38.pbsv.vcf.gz'
It made it to the downloading stage and is still running!
Request for the future: Is it possible to separate the scripts to download and install environments vs. the process_XX? It would be great to just download and build everything we need once, at the beginning, then be good to go!
Rationale: When I submit jobs to the cluster, the nodes can't connect to the internet, so I have to run each process_XX on the head node so everything can be downloaded, then cancel it when it starts doing something (so I can do that in a job). This cost me a non-trivial amount of time, because I had to watch this script to ensure that it didn't get too far (vs. set and forget).
Thank you for all of your hard work, and for fielding my questions very quickly!
There is a snakemake option called --conda-create-envs-only
that can be appended to the snakemake commands in the bash launch scripts ending in .sh
. This option should do what you're trying to do automatically.
oh yeah! Billy told me about that in a previous thread, and I forgot I used that flag. Won't forget again!
Great news - process_cohort
finished for my cohort of only probands without any obvious errors in the main logfile!
But I'm going through results+files now and see that a large number of phased vcf files in whatshap/regions
are empty, and shown below:
filesize basename
0 COHORT_proband_only.GRCh38.chr10.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr11.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr12.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr1.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr2.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr3.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr4.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr5.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr6.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr7.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr8.deepvariant.glnexus.phased.vcf.gz
0 COHORT_proband_only.GRCh38.chr9.deepvariant.glnexus.phased.vcf.gz
4.8M COHORT_proband_only.GRCh38.chrY.deepvariant.glnexus.vcf.gz
4.9M COHORT_proband_only.GRCh38.chrY.deepvariant.glnexus.phased.vcf.gz
24M COHORT_proband_only.GRCh38.chr22.deepvariant.glnexus.vcf.gz
26M COHORT_proband_only.GRCh38.chr21.deepvariant.glnexus.vcf.gz
26M COHORT_proband_only.GRCh38.chr22.deepvariant.glnexus.phased.vcf.gz
27M COHORT_proband_only.GRCh38.chr21.deepvariant.glnexus.phased.vcf.gz
32K COHORT_proband_only.GRCh38.chrM.deepvariant.glnexus.phased.vcf.gz
32K COHORT_proband_only.GRCh38.chrM.deepvariant.glnexus.vcf.gz
34M COHORT_proband_only.GRCh38.chr19.deepvariant.glnexus.vcf.gz
36M COHORT_proband_only.GRCh38.chr19.deepvariant.glnexus.phased.vcf.gz
39M COHORT_proband_only.GRCh38.chr20.deepvariant.glnexus.vcf.gz
41M COHORT_proband_only.GRCh38.chr20.deepvariant.glnexus.phased.vcf.gz
46M COHORT_proband_only.GRCh38.chr14.deepvariant.glnexus.vcf.gz
47M COHORT_proband_only.GRCh38.chr15.deepvariant.glnexus.vcf.gz
49M COHORT_proband_only.GRCh38.chr14.deepvariant.glnexus.phased.vcf.gz
49M COHORT_proband_only.GRCh38.chr16.deepvariant.glnexus.vcf.gz
49M COHORT_proband_only.GRCh38.chr17.deepvariant.glnexus.vcf.gz
50M COHORT_proband_only.GRCh38.chr15.deepvariant.glnexus.phased.vcf.gz
51M COHORT_proband_only.GRCh38.chr16.deepvariant.glnexus.phased.vcf.gz
51M COHORT_proband_only.GRCh38.chr17.deepvariant.glnexus.phased.vcf.gz
53M COHORT_proband_only.GRCh38.chrX.deepvariant.glnexus.vcf.gz
54M COHORT_proband_only.GRCh38.chr18.deepvariant.glnexus.vcf.gz
54M COHORT_proband_only.GRCh38.chrX.deepvariant.glnexus.phased.vcf.gz
56M COHORT_proband_only.GRCh38.chr18.deepvariant.glnexus.phased.vcf.gz
58M COHORT_proband_only.GRCh38.chr13.deepvariant.glnexus.vcf.gz
61M COHORT_proband_only.GRCh38.chr13.deepvariant.glnexus.phased.vcf.gz
73M COHORT_proband_only.GRCh38.chr9.deepvariant.glnexus.vcf.gz
75M COHORT_proband_only.GRCh38.chr8.deepvariant.glnexus.vcf.gz
76M COHORT_proband_only.GRCh38.chr12.deepvariant.glnexus.vcf.gz
77M COHORT_proband_only.GRCh38.chr10.deepvariant.glnexus.vcf.gz
82M COHORT_proband_only.GRCh38.chr11.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr5.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr6.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr7.deepvariant.glnexus.vcf.gz
99M COHORT_proband_only.GRCh38.chr3.deepvariant.glnexus.vcf.gz
100M COHORT_proband_only.GRCh38.chr4.deepvariant.glnexus.vcf.gz
117M COHORT_proband_only.GRCh38.chr2.deepvariant.glnexus.vcf.gz
127M COHORT_proband_only.GRCh38.chr1.deepvariant.glnexus.vcf.gz
Is this common?
That is unexpected. I would expect all *.phased.vcf.gz
to be nearly exactly the same size (minus differences due to compression) as the corresponding *.vcf.gz
. Take a look at logs like cohorts/COHORT_proband_only/logs/whatshap/phase/COHORT_proband_only.GRCh38.chr10.log
to see if anything looks off.
Also, sorry to just be catching up, but is this cohort a bunch of unrelated samples? The cohort.ref.deepvariant.glnexus.phased.vcf.gz
would probably be fine, but the filtered/annotated datasets would be probably be useless, since the expectation is that all affected samples in a cohort share the same variant.
Don't worry - Juniper was very helpful and got me through the final step. And I will run this again when we get the subreads down from the archives so I can get 5mc calls.
RE structure: This is the same data freeze that I was starting when we spoke at ASHG 2022 in LA. In this freeze is 1 complete trio and ~10 probands. For now, I ran the probands together and the trio together, in two separate cohorts.
Your statement filtered/annotated datasets would be probably be useless, since the expectation is that all affected samples in a cohort share the same variant.
answers my question that I just submitted RE phrank scores. Now I see why Juniper suggested running each sample as a separate cohort in an earlier thread.
I decided to run everything together because each proband does have an autism diagnosis, so I thought dataset-level frequency stats might be helpful. However, there is no expectation that the underlying variant, or gene, would be identical among samples. If anything, the expectation would be that the variant(s) and gene(s) would be different across individuals.
Next Step: I will run each proband as an individual cohort, and check that all whatshap files are complete. This should fix both issues (ie phasing and annotation).
For the two cohorts that I submitted yesterday, the phrank files created in both cohorts have the same md5sums, which makes me think that either I didn't make the cohorts file correctly or something is being merged somewhere.
In cohorts/TRIO_COHORT/svpack
cut -f3 TRIO_COHORT.GRCh38.pbsv.svpack.tsv | sort | uniq -c
1823 Sample1
1774 Sample2
1806 Sample3
1 sample_id
These are the right samples. The samples in the corresponding file for PROBAND_COHORT is also correct.
FYI this is my cohort file with all of the things after the colon redacted
- id: PROBAND_COHORT
phenotypes
- HP
affecteds
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id
- id: TRIO_COHORT
phenotypes
- HP
affecteds
- id
parents
-
-
sex
unaffecteds
- id
sex
- id
sex
Thank you!
Were the HPO terms the same for both cohorts? If so, the phrank.tsv files would be identical.
aaah that would be it. I don't know why I visualized the variants interacting with the phrank score when re-reading your other explanation. The variants are annotated with the gene-level phrank score. Thank you, Billy!
It looks like everything finished successfully! glnexus wasn't run when I put each proband in inividual cohorts, so those vcfs that were problematic are now non-issues.
For the future, would it be worth adding svpack and slivar to process_samples to get sample-level filtering? That way for non-trio runs you don't have to run the extra step/it saves some active time.
As-is for output from process_samples, I would imagine that everyone would want the post-processing to guide analysis, because so much data is returned. Thank you again, Billy!
Very happy everything finished successfully! Thank you for the feedback on the workflow, it's always helpful to know what our customers want/need from our tools.
Hi Billy -
Before I started running the pipeline, I made my
cohort.yaml
file in the format of the tutorial, and [now] successfully ranprocess_smrtcells
andprocess_samples
(yay - many thanks!)Now that I am going to run process_cohort, I see that I want to re-tool the layout of my
cohort.yaml
file, as I separated the trio from the rest of my non-trio probands.It looks like I can change my
cohort.yaml
file before running step 3, because I don't think it has been used yet. but wanted to confirm! Is this OK?Also, would you recommend running all of my samples in the same cohort, or running cohorts delineated by batches? ie we had samples returned to us 4/batch (and have ~4 batches at this point). Given the accuracy of this technology, I wasn't concerned with batch effects here, and think there would be more value in creating one cohort, but would appreciate your opinion at this step.
Thank you again!