U54Bioinformatics / PhylinSic_Project

MIT License

5 stars 1 forks source link

Application to Smart-seq data #1

Open jmfa opened 1 year ago

jmfa commented 1 year ago

Hi, Thanks for this very useful package! I'm interesting in running the tool on some smart-seq2 datasets. How should I set up the files so that I can keep using the same snakefile?

Thanks in advance, Joao

jefftc commented 1 year ago

Hi Joao,

To do this, you will need to skip the early part of the pipeline where it deconvolutes the CellRanger results and provide from your SmartSeq data one BAM for each cell.

If you look in the Snakefile, the output files created in each step of the workflow are documented. We would be skipping step 02_demux, and starting with the files in step 03_preproc.

One complication is that the pipeline processes batches of cells together so that it runs faster. However, since you are using SmartSeq, you won't be working with as many cells as 10X, so it will not be computationally prohibitive to run all the cells into one batch. You can simplify the pipeline for yourself by just processing all the cells for each sample in a single batch. To do this, set in the Snakefile, the following parameter to a very large number (larger than the total number of cells): CELLS_PER_BATCH_PREPROCESSING = 999999

Then, try starting with: data/ cells.txt genome.fa known_sites1.vcf.gz known_sites1.vcf.gz.tbi known_sites2.vcf.gz known_sites2.vcf.gz.tbi known_sites3.vcf.gz known_sites3.vcf.gz.tbi output/ 03_preproc/{sample}.0.cells.bam/ {cell}.bam

The data/ directory is set up the same way, except no cell ranger results are needed.

The output/03_preproc/{sample}.0.cells.bam/ directory (for {sample} and batch 0, the only batch) contains the BAM files for each cell from {sample}. The BAM files should be named {cell}.bam where {cell} is the name of the cell (it can be a barcode, or the plate location, etc).

I haven't tested this, and it's possible there's some header information in the BAM files that may need to be fixed. In any case, please run this and let me know if you run into problems.

Thanks, Jeff

On Aug 30, 2023, at 5:18 AM, Joao MF Alves @.***> wrote:

Hi, Thanks for this very useful package! I'm interesting in running the tool on some smart-seq2 datasets. How should I set up the files so that I can keep using the same snakefile? Thanks in advance, Joao — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

jmfa commented 1 year ago

Hi Jeff,

Thanks for your help. I did manage to modify the snakefile to get the tool running. However, I got this error during the parse_vcf_files step:

[Mon Sep 18 11:07:08 2023]
rule parse_vcf_files:
    input: output/05_call/variants.0.snp_only.vcf
    output: output/05_call/variants.table.txt
    jobid: 34

Traceback (most recent call last):
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 72, in <module>
    main()
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 27, in main
    assert len(cols) == 10
AssertionError
[Mon Sep 18 11:07:09 2023]
Error in rule parse_vcf_files:
    jobid: 34
    output: output/05_call/variants.table.txt

RuleException:
CalledProcessError in line 1066 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED:
Command 'set -euo pipefail;  /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py' returned non-zero exit status 1.
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2347, in run_wrapper
  File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED", line 1066, in __rule_parse_vcf_files
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 568, in _callback
  File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 554, in cached_or_run
  File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2359, in run_wrapper
Removing output files of failed job parse_vcf_files since they might be corrupted:
output/05_call/variants.table.txt
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-09-18T110707.810852.snakemake.log

The input file is correctly placed under "output/05_call/variants.0.snp_only.vcf". Any idea of what might be causing this problem?

Thanks again, J

jefftc commented 1 year ago

Hi Joao,

It looks like there's a problem parsing the VCF file. Can you send me the contents of the directory: output/05_call/

I'd like to start by taking a look at those VCF files.

Thanks, Jeff

On Sep 18, 2023, at 4:13 AM, Joao MF Alves @.***> wrote:

Hi Jeff, Thanks for your help. I did manage to modify the snakefile to get the tool running. However, I got this error during the parse_vcf_files step: [Mon Sep 18 11:07:08 2023] rule parse_vcf_files: input: output/05_call/variants.0.snp_only.vcf output: output/05_call/variants.table.txt jobid: 34 Traceback (most recent call last): File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 72, in main() File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py", line 27, in main assert len(cols) == 10 AssertionError [Mon Sep 18 11:07:09 2023] Error in rule parse_vcf_files: jobid: 34 output: output/05_call/variants.table.txt RuleException: CalledProcessError in line 1066 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED: Command 'set -euo pipefail; /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpx196dkvi.parse_vcf_files.py' returned non-zero exit status 1. File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2347, in run_wrapper File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_ED", line 1066, in __rule_parse_vcf_files File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 568, in _callback File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 554, in cached_or_run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2359, in run_wrapper Removing output files of failed job parse_vcf_files since they might be corrupted: output/05_call/variants.table.txt Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-09-18T110707.810852.snakemake.log The input file is correctly placed under "output/05_call/variants.0.snp_only.vcf". Any idea of what might be causing this problem? Thanks again, J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jmfa commented 1 year ago

05_call.zip Hi Jeff, I'm dropping here the files in the 05_call folder. Thanks for your help, J

jefftc commented 1 year ago

Thanks for the files--they were helpful.

It looks like the @RG lines in the BAM files had different sample names for each sample. After merging into a pseudobulk sample, the variant caller was generating calls for each sample separately, which was confusing the VCF parser because it expected only a single set of calls for the pseudobulk.

I have made a small change that changes the sample names in all the BAM files to "pseudobulk". I'm running some tests here, but if you would like to try it out in the meantime, please update your code and try re-running the pipeline. You will need to start from the beginning--delete any output files that have been generated so far.

Let me know if this does not fix the issue, or if there are further problems.

Thanks, Jeff

On Sep 19, 2023, at 3:33 AM, Joao MF Alves @.***> wrote:

05_call.zip Hi Jeff, I'm dropping here the files in the 05_call folder. Thanks for your help, J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jmfa commented 1 year ago

Hi Jeff,

I updated the code and this issue seems to be solved. However, I got a different error at a later stage of the pipeline:

` [Thu Oct 19 07:04:21 2023] Finished job 64. 36 of 55 steps (65%) done Select jobs to execute...

[Thu Oct 19 07:04:21 2023] rule list_sites_with_mixed_genotypes: input: output/07_filter/genotype.count.txt output: output/07_filter/mixed_genotypes.txt jobid: 63

[Thu Oct 19 07:04:21 2023] Error in rule list_sites_with_mixed_genotypes: jobid: 63 output: output/07_filter/mixed_genotypes.txt

RuleException: CalledProcessError in line 1479 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2: Command 'set -euo pipefail; /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpvo_o2ml8.list_sites_with_mixed_genotypes.py' returned non-zero exit status 1. File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2347, in run_wrapper File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2", line 1479, in rule_list_sites_with_mixed_genotypes File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 568, in _callback File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 554, in cached_or_run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init__.py", line 2359, in run_wrapper Removing output files of failed job list_sites_with_mixed_genotypes since they might be corrupted: output/07_filter/mixed_genotypes.txt Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-10-18T192121.168608.snakemake.log `

The input file (output/07_filter/genotype.count.txt) looks fine so I'm not sure what's causing this error. Any idea?

Thanks J

jefftc commented 1 year ago

Hmmm... There's not an obvious error message here, so not sure where this is failing.

Can you send me the input file: output/07_filter/genotype.count.txt

The value of "KEEP_SITES_WITH_MIXED_GENOTYPES" in your Snakefile.

Also the log file: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-10-18T192121.168608.snakemake.log

Thanks, Jeff

On Oct 19, 2023, at 2:58 AM, Joao MF Alves @.***> wrote:

Hi Jeff, I updated the code and this issue seems to be solved. However, I got a different error at a later stage of the pipeline: [Thu Oct 19 07:04:21 2023] Finished job 64. 36 of 55 steps (65%) done Select jobs to execute... [Thu Oct 19 07:04:21 2023] rule list_sites_with_mixed_genotypes: input: output/07_filter/genotype.count.txt output: output/07_filter/mixed_genotypes.txt jobid: 63 [Thu Oct 19 07:04:21 2023] Error in rule list_sites_with_mixed_genotypes: jobid: 63 output: output/07_filter/mixed_genotypes.txt RuleException: CalledProcessError in line 1479 of /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2: Command 'set -euo pipefail; /mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/bin/python3.7 /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/scripts/tmpvo_o2ml8.list_sites_with_mixed_genotypes.py' returned non-zero exit status 1. File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2347, in run_wrapper File "/mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/Snakefile_v2", line 1479, in __rule_list_sites_with_mixed_genotypes File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 568, in _callback File "/mnt/netapp1/Optcesga_FT2_RHEL7/2020/gentoo/22072020/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 554, in cached_or_run File "/opt/cesga/2020/software/Core/snakemake/6.0.5/lib/python3.7/site-packages/snakemake/executors/init.py", line 2359, in run_wrapper Removing output files of failed job list_sites_with_mixed_genotypes since they might be corrupted: output/07_filter/mixed_genotypes.txt Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /mnt/lustre/hsm/nlsas/notape/home/uvi/be/posadalab/jmf.alves/SC-RNA/PhylinSic/PhylinSic_Project/.snakemake/log/2023-10-18T192121.168608.snakemake.log The input file (output/07_filter/genotype.count.txt) looks fine so I'm not sure what's causing this error. Any idea? Thanks J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jmfa commented 1 year ago

Hi Jeff,

I'm attaching the files you requested. Also, this is what I have in the snakefile: "KEEP_SITES_WITH_MIXED_GENOTYPES = 20"

Thanks, J 2023-10-18T192121.168608.snakemake.log genotype.count.txt

jefftc commented 1 year ago

Thanks!

The issue was that some of the chromosome names contain underscores, e.g.: chr11_KI270831v1_alt

The code was not expecting that, and was throwing an exception. I'm not sure why the error message did not show up in the logs. In any case, I found two places where this can cause issues and have fixed them. Let me know if there are any other places that I've missed.

Please update the code from github and try again.

Thanks for helping me to track down and fix these issues!

Jeff

On Oct 24, 2023, at 3:11 AM, Joao MF Alves @.***> wrote:

Hi Jeff, I'm attaching the files you requested. Also, this is what I have in the snakefile: "KEEP_SITES_WITH_MIXED_GENOTYPES = 20" Thanks, J 2023-10-18T192121.168608.snakemake.log genotype.count.txt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jmfa commented 1 year ago

Hi Jeff, Ok so I think we're pretty close but now there's an error with the beast2 section of the pipeline. This is the error I get:

[1] "Start with 46 sequences and 1716 positions." [1] "Final data has 46 sequences and 1716 positions, 0.0% sparse." [1] "Fitting tree over 100,000 iterations." Error instop_vctrs()`: ! Input must be a vector, not an environment. Backtrace: ▆

├─global run.beast(...)
│ └─beautier::create_tracelog(trace.log, log_every = sample.interval)
│ └─beautier::check_tracelog(tracelog)
│ └─beautier::check_tracelog_values(tracelog)
│ └─beautier::check_filename(tracelog$filename, allow_na = TRUE)
│ └─stringr::str_detect(filename, " ")
│ └─stringr:::check_lengths(string, pattern)
│ └─vctrs::vec_size_common(...)
└─vctrs:::stop_scalar_type(<fn>(<env>), "")
1. └─vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x)
2. └─rlang::abort(message, class = c(class, "vctrs_error"), ...) `

Any idea of what might be causing this? I had to install a lot of packages manually as I'm currently using R version 4.0.4. However all required packages seem to be installed and working.

Thanks J

jefftc commented 11 months ago

Hi Joao,

I have tried re-running the pipeline from my end and cannot replicate this error. My guess is that could be caused by:

Bad data being fed into beast2.
Incompatibility among beautier, stringr, or vctrs.

For #1, it looks like the data is reasonable from the error message.

2 could be an issue, as other people have reported this issue online and resolved it by changing the version of the stringr or vctrs packages.

https://stackoverflow.com/questions/75048803/how-to-solve-input-must-be-a-vector-not-an-environment-when-crawling-website

Maybe try this first? I am using: beautier 2.6.6 stringr 1.5.0 vctrs 0.6.3

Jeff

On Oct 30, 2023, at 6:03 AM, Joao MF Alves @.***> wrote:

Hi Jeff, Ok so I think we're pretty close but now there's an error with the beast2 section of the pipeline. This is the error I get: [1] "Start with 46 sequences and 1716 positions." [1] "Final data has 46 sequences and 1716 positions, 0.0% sparse." [1] "Fitting tree over 100,000 iterations." Error instop_vctrs()`: ! Input must be a vector, not an environment. Backtrace: ▆ • ├─global run.beast(...) • │ └─beautier::create_tracelog(trace.log, log_every = sample.interval) • │ └─beautier::check_tracelog(tracelog) • │ └─beautier::check_tracelog_values(tracelog) • │ └─beautier::check_filename(tracelog$filename, allow_na = TRUE) • │ └─stringr::str_detect(filename, " ") • │ └─stringr:::check_lengths(string, pattern) • │ └─vctrs::vec_size_common(...) • └─vctrs:::stop_scalar_type((), "") • └─vctrs:::stop_vctrs(msg, "vctrs_error_scalar_type", actual = x) • └─rlang::abort(message, class = c(class, "vctrs_error"), ...)

` Any idea of what might be causing this? I had to install a lot of packages manually as I'm currently using R version 4.0.4. However all required packages seem to be installed and working. Thanks J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jmfa commented 11 months ago

Hi Jeff, Thanks again for your help. I'm now re-running the full pipeline with the updated R packages.

Small question though:

is the mutations.fa the final callset from Phylinsic or are there any additional filtering/genotyping steps after the beast run? The reason I'm asking is because I may want to test a different tree estimation method using the genotype calls from Phylinsic but I'm unsure as to whether this fasta file represents the final genotype calls.

Thanks, J

jmfa commented 11 months ago

Hi (again) Jeff,

Sorry to keep posting here but I was comparing the genotypes and fasta files that Phylinsic outputs and I have noticed that the make_fasta_file.py script may be assigning incorrect alleles for heterozygous sites.

For instance, here's the info printed in the genotypes.txt file for Cell_02:

# genotypes.txt
Site    Cell_02
chr1_139213_A_G A
chr1_139266_T_C R
chr1_139441_A_G R
chr1_629274_G_A H
chr1_630596_C_T A
chr1_631580_T_C H
chr1_958492_T_C R
chr1_1562772_C_T    R

If I'm not mistaken, the genotype calls should be used as follows: A = Homozygous alternative, R = Homozygous reference, H = Heterozygous call.

However, this is the output info in the fasta file:

>Cell_02
GTACTATC

As you can see, the a C & A alleles were assigned to position 4 and 6, respectively, instead of R & Y.

I didn't get any errors in this step of the pipeline so I'm wondering whether this is a bug in the code.

Cheers, J

jefftc commented 11 months ago

Hi Joao,

This is as designed. We were not getting good results using ambiguous codes in the alignments, and were more successful when encoding heterozygous positions using an arbitrary base. It is possible that something better could be done with a deeper understanding of BEAST2 or how to encode site models, but we haven't revisited this issue yet.

Jeff

On Nov 24, 2023, at 4:47 AM, Joao MF Alves @.***> wrote:

Hi (again) Jeff, Sorry to keep posting here but I was comparing the genotypes and fasta files that Phylinsic outputs and I have noticed that the make_fasta_file.py script may be assigning incorrect alleles for heterozygous sites. For instance, here's the info printed in the genotypes.txt file for Cell_02:

genotypes.txt

Site Cell_02 chr1_139213_A_G A chr1_139266_T_C R chr1_139441_A_G R chr1_629274_G_A H chr1_630596_C_T A chr1_631580_T_C H chr1_958492_T_C R chr1_1562772_C_T R

If I'm not mistaken, the genotype calls should be used as follows: A = Homozygous alternative, R = Homozygous reference, H = Heterozygous call. However, this is the output info in the fasta file:

Cell_02 GTACTATC

As you can see, the a C & A alleles were assigned to position 4 and 6, respectively, instead of R & Y. I didn't get any errors in this step of the pipeline so I'm wondering whether this is a bug in the code. Cheers, J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

jefftc commented 11 months ago

Yes, that's the file used to make the phylogeny.

Jeff

On Nov 23, 2023, at 3:48 AM, Joao MF Alves @.***> wrote:

Hi Jeff, Thanks again for your help. I'm now re-running the full pipeline with the updated R packages. Small question though: • is the mutations.fa the final callset from Phylinsic or are there any additional filtering/genotyping steps after the beast run? The reason I'm asking is because I may want to test a different tree estimation method using the genotype calls from Phylinsic but I'm unsure as to whether this fasta file represents the final genotype calls. Thanks, J — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>