Closed sjfleck closed 2 years ago
Hello,
Yes, you shoud be able to obtain a psmcfa file from a VCF file and a reference genome file (also in FASTA) whatever the plarform used to generate the reads and their mapping.
Have you seen the package Rsubread on BioConductor? It's quite simple to use and it can map short or long reads; the results are output in a BAM file.
Best, E.
I'm not familiar with Rsubread, but I can try it out
As I mentioned, I've had many errors during the mpileup step (using the ancient bamtools/bcftools and original PSMC) with BAM files created with minimap2. I'll give Rsubread a try and use the newest bcftools to create a VCF to use with your PSMCR
I'm having some trouble running the VCF2DNAbin function. My goal was to run it on my university's cluster instead of running it locally. I saw "The function VCF2DNAbin takes as input the VCF file and the reference genome and outputs a consensus sequence" but I had a tough time figuring out how to use "Rscript $PSMCR/VCF2DNAbin.R."
Since I wasn't making progress submitting it to my cluster, I decided to run it locally in R: VCF2DNAbin("file.vcf.gz", refgenome = "file.fasta.gz", individual = 1, quiet = FALSE)
But ended up with an error: Error in read.FASTA(refgenome) : could not find function "read.FASTA"
I found a package with the read.FASTA() function, but after loading library(seqinr), it still gave the same error. I also tried:
VCF2DNAbin("file.vcf.gz", refgenome = NULL, individual = 1, quiet = FALSE)
but it also gave the same error. Any help with solving this would be greatly appreciated. Thank you in advance
It's far far better that you install the package and load it: this will import all the functions you need. Rscript is mostly used to run R in batch mode (I usually prefer to use R CMD BATCH).
Thank you for your help. Your suggestion to use "R CMD BATCH" was very useful and I was able to create the plots I needed. I have 2 follow up questions:
(1.) can this version also plot the bootstraps onto the plot, such as in this published PSMC plot from Nadachowska-Brzyska et al. (2016)?
(2.) I haven't been able to find an option for it, but can plot.psmc transform the X and Y axes? For example, using a log transformation on the X-axis so I can see the recent events more clearly. I also think it would be easier to look at if the Y-axis values could be scaled by a custom power of 10 (i.e. 10^5 for my current plots).
Any help would be greatly appreciated. Thank you
(1) Yes the default is to show the bootstrap values: like this.
(2) Yes you can use the standard option log
which can be one of these: log = "x"
, log = "y"
, log = "xy"
. Be careful that by default 0 is included in the range of values on the x-axis because it is the present time: this will cause a problem if you log-transform this axis. You can then change the option xlim
which is NULL
by default to something like: xlim = c(1e4, 1e7)
.
Best, Emmanuel
For some reason, my plot doesn't show the bootstraps. First, I create my .rds files with 100 bootstraps:
library(ape) library(psmcr) ref<-psmcr::VCF2DNAbin("sample_1.vcf.gz", refgenome = "sample_1.fasta", individual = 1, quiet = FALSE) ref2<-psmcr::seqBinning(ref, bin.size=100) fil<-tempfile(pattern = "data", tmpdir = tempdir(), fileext = ".rds") saveRDS(ref2, file = 'data.rds') x<-readRDS('data.rds') o <- psmc(x, parapatt = "4+25*2+4+6", maxt = 10, niters = 20, B = 100, trunksize = 5e4, mc.cores = 4) saveRDS(o, "PSMCR.sample_1.output_run_psmc.rds")
I do this for sample 2 and 3 as well. Then I create a combined plot:
library(ape) library(psmcr) PSMC1 <- readRDS("PSMCR.sample_1.output_run_psmc.rds") PSMC2 <- readRDS("PSMCR.sample_2.output_run_psmc.rds") PSMC3 <- readRDS("PSMCR.sample_3.output_run_psmc.rds") pdf("PSMCR.Samples.pdf") plot(PSMC1, mutation.rate = 2.5e-9, g = 15, col = "blue", ylab = "Effective Population Size (Ne)", xlab = "Time (yr) (generation time = 15, mutation rate = 2.5 x 10^-9)", log = 'x', xlim = c(1e3, 5e6)) lines(PSMC2, mutation.rate = 2.5e-9, g = 15, col = "green") lines(PSMC3, mutation.rate = 2.5e-9, g = 15, col = "red")
But I'm only getting a singe line for each sample. Do you see where I might be causing this to happen? Thank you
Did you try to plot directly the object o
(i.e., before saving )?
Can you send me the file 'PSMCR.sample_1.output_run_psmc.rds'?
I didn't plot the object before saving this time around, but I can give it a shot.
Here is the file: PSMCR.sample_1.output_run_psmc.rds.zip
This ZIP contains only an empty folder...
I apologize for that. Here is a zip file that should contain the file
Thanks! It was a bug that made the bootstrap lines invisible (actually too much transparency). This should be fixed. You need to reinstall from GH.
Hello, I'm pretty familiar with PSMC and have used it a number of time to create population history plots. I'm currently working with some samples that only have Oxford Nanopore long-read data to create sorted BAM files with. From my understanding, PSMC only works with samtools-0.1.18 or samtools-0.1.19 and these ancient versions predate long-read technology. At least in my experience, these samtools versions clearly have issues calling variants from a SAM file created by mapping long-reads onto a reference genome with minimap2, for example.
I assume this still holds true for your version, but I wanted to make sure. If I can use minimap2 to create the SAM file and pipe it into a newer samtools version to created a sorted BAM file, I should be able to use this command to have my input for your version:
bcftools mpileup -C 50 -Ou --threads 16 -f ref.fa aln.sorted.bam | bcftools call -c -Ov -o aln.sorted.pdrm.original_call.vcf
From here, am I understanding correctly that running "Rscript $PSMCR/VCF2DNAbin.R" will convert the VCF to the psmcfa file, essentially replacing the need to do the VCF->FASTQ->PSMCFA conversions?
Thank you for any help you can provide and in the mean time, I'm looking for newer tools (if they even exist) that can do PSMC-like analyses for my data.