cohort.yaml question - Githubissues

lauragails commented 1 year ago

Hi Billy -
Before I started running the pipeline, I made my cohort.yaml file in the format of the tutorial, and [now] successfully ran process_smrtcells and process_samples (yay - many thanks!)

Now that I am going to run process_cohort, I see that I want to re-tool the layout of my cohort.yaml file, as I separated the trio from the rest of my non-trio probands.

It looks like I can change my cohort.yaml file before running step 3, because I don't think it has been used yet. but wanted to confirm! Is this OK?

Also, would you recommend running all of my samples in the same cohort, or running cohorts delineated by batches? ie we had samples returned to us 4/batch (and have ~4 batches at this point). Given the accuracy of this technology, I wasn't concerned with batch effects here, and think there would be more value in creating one cohort, but would appreciate your opinion at this step.

Thank you again!

juniper-lake commented 1 year ago

Good to hear that process_smrtcells and process_samples are running well for you!

The cohort.yaml file is only used by process_cohort so no worries about changing it.

With regard to cohort size, batch effect is not a concern for this workflow so you should select cohorts based on your research goal. Most often we run singletons, trios, duos, and quads. Multi-sample cohorts are beneficial if you want to jointly call variants, which you only really care about when you're examining inheritance or if you want some sort of population statistics for variants like allele frequency. If you're interested in running big cohorts for population statistics, it's important to note that joint calling in pbsv sometimes fails with large sample sizes (100-200+ samples ).

In short, I'd recommend running your samples based on their relationship to other samples you've sequenced, such as singletons, trios, duos, etc.

lauragails commented 1 year ago

Ok! To clarify for my experimental design, I have ~12 singletons and one trio. Right now I have that being run as two cohorts, one with the singletons and one trio.

You would not recommend that I make one single cohort, while specifying parental relation for the trio?

Thank you!

juniper-lake commented 1 year ago

Personally, I would specify 13 cohorts here (12 singletons with no relationship info and 1 trio with parental relations specified).

The only situation that I would combine them all into one cohort is if you were studying some gene or set of genes that contribute to a phenotype that is shared among all "affected" samples in the cohort.

lauragails commented 1 year ago

I see - well they all have the same syndrome, but sample size is too small to make any definite statements. Will runtime (or memory needs) be drastically impacted if I run all of the singletons together?

On Fri, Jan 6, 2023, 6:17 PM Juniper A. Lake @.***> wrote:

Personally, I would specify 13 cohorts here (12 singletons with no relationship info and 1 trio with parental relations specified).

The only situation that I would combine them all into one cohort is if you were studying some gene or set of genes that contribute to a phenotype that is shared among all "affected" samples in the cohort.

— Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/pb-human-wgs-workflow-snakemake/issues/149#issuecomment-1374246424, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVNCLOCSR5UUZOGI2K4ZWLWRCRYXANCNFSM6AAAAAATTILPX4 . You are receiving this because you authored the thread.Message ID: <PacificBiosciences/pb-human-wgs-workflow-snakemake/issues/149/1374246424@ github.com>

juniper-lake commented 1 year ago

Yes, runtime and memory use will probably be higher but not excessively. Runtime isn't very long on process_cohort unless you have trio assembly running. Since this cohort would be all singletons, I don't expect computational requirements to be an issue at all.

lauragails commented 1 year ago

great! Then that is what I will do. Thank you, Juniper!

lauragails commented 1 year ago

Having issues running:

[truncated]/workflow/process_cohort.local.sh TRIO

Errors

Config file workflow/reference.yaml is extended by additional config specified via the command line.
Config file workflow/config.yaml is extended by additional config specified via the command line.
FileNotFoundError in line 15 of [truncated]/workflow/process_cohort.smk:
[Errno 2] No such file or directory: 'cohort.yaml'
  File "[truncated]/workflow/process_cohort.smk", line 55, in <module>
  File "[truncated]/workflow/process_cohort.smk", line 15, in get_samples

Processing cohort TRIO with reference GRCh38.

It appears that the yaml files are being read given the "Processing..." part. Thank you!

juniper-lake commented 1 year ago

The "Processing..." part is just a print statement and doesn't indicate that yaml files are being read correctly. It just indicates that the name of the cohort ("TRIO" in this case) was read from the command line. The problem here is that your cohort.yaml file isn't in the correct place. It should be in your working directory. The location of this file can be configured in workflow/config.yaml under the entry cohort_yaml.

lauragails commented 1 year ago

I see - I edited that path to read "workflow/config.yaml" and am now able to proceed.

However I am hitting another error

Config file workflow/reference.yaml is extended by additional config specified via the command line.
Config file workflow/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
MissingInputException in line 4 of /sc/arion/projects/buxbaj01a/sloofl01/pac_bio/results/pacbio_wgs/121422/workflow/rules/cohort_svpack.smk:
Missing input files for rule svpack_filter_annotated:
resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz
Processing cohort PROBAND_COHORT with reference GRCh38.
Samples in cohort: ['S1', 'S2',...etc]. 

No trios found in cohort.

(I put the trio and singletons in two different cohorts)

How can we fix this rule issue? Thank you, Juniper!

juniper-lake commented 1 year ago

The error message tells you which file is missing: resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz. Please check that the resources folder is in the correct location and contains all necessary files. If necessary, re-download the resources folder using the FTP2 information that was originally provided to you by Billy.

lauragails commented 1 year ago

It's not in the folder that I have access to (see screenshot)

I don't think it's there? Here are the vcf files in resources/hprc ~5m ago

hprc.GRCh38.deepvariant.vcf.gz
hprc.GRCh38.deepvariant.vcf.gz.tbi
hprc.GRCh38.pbsv.vcf.gz
hprc.GRCh38.pbsv.vcf.gz.tbi
README

williamrowell commented 1 year ago

It's hprc.GRCh38.pbsv.vcf.gz.

juniper-lake commented 1 year ago

Thanks @williamrowell

In workflow/reference.yaml, change hprc_pbsv_vcf: 'resources/hprc/hprc.GRCh38.pbsv.v.2.6.0-20210417.vcf.gz' to hprc_pbsv_vcf: 'resources/hprc/hprc.GRCh38.pbsv.vcf.gz'

lauragails commented 1 year ago

It made it to the downloading stage and is still running!

Request for the future: Is it possible to separate the scripts to download and install environments vs. the process_XX? It would be great to just download and build everything we need once, at the beginning, then be good to go!

Rationale: When I submit jobs to the cluster, the nodes can't connect to the internet, so I have to run each process_XX on the head node so everything can be downloaded, then cancel it when it starts doing something (so I can do that in a job). This cost me a non-trivial amount of time, because I had to watch this script to ensure that it didn't get too far (vs. set and forget).

Thank you for all of your hard work, and for fielding my questions very quickly!

juniper-lake commented 1 year ago

There is a snakemake option called --conda-create-envs-only that can be appended to the snakemake commands in the bash launch scripts ending in .sh. This option should do what you're trying to do automatically.

lauragails commented 1 year ago

oh yeah! Billy told me about that in a previous thread, and I forgot I used that flag. Won't forget again!

lauragails commented 1 year ago

Great news - process_cohort finished for my cohort of only probands without any obvious errors in the main logfile!

But I'm going through results+files now and see that a large number of phased vcf files in whatshap/regions are empty, and shown below:

filesize    basename
0   COHORT_proband_only.GRCh38.chr10.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr11.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr12.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr1.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr2.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr3.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr4.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr5.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr6.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr7.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr8.deepvariant.glnexus.phased.vcf.gz
0   COHORT_proband_only.GRCh38.chr9.deepvariant.glnexus.phased.vcf.gz
4.8M    COHORT_proband_only.GRCh38.chrY.deepvariant.glnexus.vcf.gz
4.9M    COHORT_proband_only.GRCh38.chrY.deepvariant.glnexus.phased.vcf.gz
24M COHORT_proband_only.GRCh38.chr22.deepvariant.glnexus.vcf.gz
26M COHORT_proband_only.GRCh38.chr21.deepvariant.glnexus.vcf.gz
26M COHORT_proband_only.GRCh38.chr22.deepvariant.glnexus.phased.vcf.gz
27M COHORT_proband_only.GRCh38.chr21.deepvariant.glnexus.phased.vcf.gz
32K COHORT_proband_only.GRCh38.chrM.deepvariant.glnexus.phased.vcf.gz
32K COHORT_proband_only.GRCh38.chrM.deepvariant.glnexus.vcf.gz
34M COHORT_proband_only.GRCh38.chr19.deepvariant.glnexus.vcf.gz
36M COHORT_proband_only.GRCh38.chr19.deepvariant.glnexus.phased.vcf.gz
39M COHORT_proband_only.GRCh38.chr20.deepvariant.glnexus.vcf.gz
41M COHORT_proband_only.GRCh38.chr20.deepvariant.glnexus.phased.vcf.gz
46M COHORT_proband_only.GRCh38.chr14.deepvariant.glnexus.vcf.gz
47M COHORT_proband_only.GRCh38.chr15.deepvariant.glnexus.vcf.gz
49M COHORT_proband_only.GRCh38.chr14.deepvariant.glnexus.phased.vcf.gz
49M COHORT_proband_only.GRCh38.chr16.deepvariant.glnexus.vcf.gz
49M COHORT_proband_only.GRCh38.chr17.deepvariant.glnexus.vcf.gz
50M COHORT_proband_only.GRCh38.chr15.deepvariant.glnexus.phased.vcf.gz
51M COHORT_proband_only.GRCh38.chr16.deepvariant.glnexus.phased.vcf.gz
51M COHORT_proband_only.GRCh38.chr17.deepvariant.glnexus.phased.vcf.gz
53M COHORT_proband_only.GRCh38.chrX.deepvariant.glnexus.vcf.gz
54M COHORT_proband_only.GRCh38.chr18.deepvariant.glnexus.vcf.gz
54M COHORT_proband_only.GRCh38.chrX.deepvariant.glnexus.phased.vcf.gz
56M COHORT_proband_only.GRCh38.chr18.deepvariant.glnexus.phased.vcf.gz
58M COHORT_proband_only.GRCh38.chr13.deepvariant.glnexus.vcf.gz
61M COHORT_proband_only.GRCh38.chr13.deepvariant.glnexus.phased.vcf.gz
73M COHORT_proband_only.GRCh38.chr9.deepvariant.glnexus.vcf.gz
75M COHORT_proband_only.GRCh38.chr8.deepvariant.glnexus.vcf.gz
76M COHORT_proband_only.GRCh38.chr12.deepvariant.glnexus.vcf.gz
77M COHORT_proband_only.GRCh38.chr10.deepvariant.glnexus.vcf.gz
82M COHORT_proband_only.GRCh38.chr11.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr5.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr6.deepvariant.glnexus.vcf.gz
89M COHORT_proband_only.GRCh38.chr7.deepvariant.glnexus.vcf.gz
99M COHORT_proband_only.GRCh38.chr3.deepvariant.glnexus.vcf.gz
100M    COHORT_proband_only.GRCh38.chr4.deepvariant.glnexus.vcf.gz
117M    COHORT_proband_only.GRCh38.chr2.deepvariant.glnexus.vcf.gz
127M    COHORT_proband_only.GRCh38.chr1.deepvariant.glnexus.vcf.gz

Is this common?

williamrowell commented 1 year ago

That is unexpected. I would expect all *.phased.vcf.gz to be nearly exactly the same size (minus differences due to compression) as the corresponding *.vcf.gz. Take a look at logs like cohorts/COHORT_proband_only/logs/whatshap/phase/COHORT_proband_only.GRCh38.chr10.log to see if anything looks off.

Also, sorry to just be catching up, but is this cohort a bunch of unrelated samples? The cohort.ref.deepvariant.glnexus.phased.vcf.gz would probably be fine, but the filtered/annotated datasets would be probably be useless, since the expectation is that all affected samples in a cohort share the same variant.

lauragails commented 1 year ago

Don't worry - Juniper was very helpful and got me through the final step. And I will run this again when we get the subreads down from the archives so I can get 5mc calls.

RE structure: This is the same data freeze that I was starting when we spoke at ASHG 2022 in LA. In this freeze is 1 complete trio and ~10 probands. For now, I ran the probands together and the trio together, in two separate cohorts.

Your statement filtered/annotated datasets would be probably be useless, since the expectation is that all affected samples in a cohort share the same variant. answers my question that I just submitted RE phrank scores. Now I see why Juniper suggested running each sample as a separate cohort in an earlier thread.

I decided to run everything together because each proband does have an autism diagnosis, so I thought dataset-level frequency stats might be helpful. However, there is no expectation that the underlying variant, or gene, would be identical among samples. If anything, the expectation would be that the variant(s) and gene(s) would be different across individuals.

Next Step: I will run each proband as an individual cohort, and check that all whatshap files are complete. This should fix both issues (ie phasing and annotation).

lauragails commented 1 year ago

For the two cohorts that I submitted yesterday, the phrank files created in both cohorts have the same md5sums, which makes me think that either I didn't make the cohorts file correctly or something is being merged somewhere.

In cohorts/TRIO_COHORT/svpack cut -f3 TRIO_COHORT.GRCh38.pbsv.svpack.tsv | sort | uniq -c 1823 Sample1 1774 Sample2 1806 Sample3 1 sample_id

These are the right samples. The samples in the corresponding file for PROBAND_COHORT is also correct.

FYI this is my cohort file with all of the things after the colon redacted

- id: PROBAND_COHORT
  phenotypes
  - HP
  affecteds
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id
  - id

- id: TRIO_COHORT 
  phenotypes
  - HP
  affecteds
  - id
    parents
    -  
    -  
    sex
  unaffecteds
  - id
    sex
  - id
    sex

Thank you!

williamrowell commented 1 year ago

Were the HPO terms the same for both cohorts? If so, the phrank.tsv files would be identical.

lauragails commented 1 year ago

aaah that would be it. I don't know why I visualized the variants interacting with the phrank score when re-reading your other explanation. The variants are annotated with the gene-level phrank score. Thank you, Billy!

lauragails commented 1 year ago

It looks like everything finished successfully! glnexus wasn't run when I put each proband in inividual cohorts, so those vcfs that were problematic are now non-issues.

For the future, would it be worth adding svpack and slivar to process_samples to get sample-level filtering? That way for non-trio runs you don't have to run the extra step/it saves some active time.

As-is for output from process_samples, I would imagine that everyone would want the post-processing to guide analysis, because so much data is returned. Thank you again, Billy!

juniper-lake commented 1 year ago

Very happy everything finished successfully! Thank you for the feedback on the workflow, it's always helpful to know what our customers want/need from our tools.

PacificBiosciences / pb-human-wgs-workflow-snakemake

cohort.yaml question #149