emmanuelparadis / psmcr

R Port of psmc
Other
12 stars 4 forks source link

is PSMCR only compatible with samtools-0.1.18/0.1.19 and paired-end short reads? #7

Closed sjfleck closed 2 years ago

sjfleck commented 2 years ago

Hello, I'm pretty familiar with PSMC and have used it a number of time to create population history plots. I'm currently working with some samples that only have Oxford Nanopore long-read data to create sorted BAM files with. From my understanding, PSMC only works with samtools-0.1.18 or samtools-0.1.19 and these ancient versions predate long-read technology. At least in my experience, these samtools versions clearly have issues calling variants from a SAM file created by mapping long-reads onto a reference genome with minimap2, for example.

I assume this still holds true for your version, but I wanted to make sure. If I can use minimap2 to create the SAM file and pipe it into a newer samtools version to created a sorted BAM file, I should be able to use this command to have my input for your version:

bcftools mpileup -C 50 -Ou --threads 16 -f ref.fa aln.sorted.bam | bcftools call -c -Ov -o aln.sorted.pdrm.original_call.vcf

From here, am I understanding correctly that running "Rscript $PSMCR/VCF2DNAbin.R" will convert the VCF to the psmcfa file, essentially replacing the need to do the VCF->FASTQ->PSMCFA conversions?

Thank you for any help you can provide and in the mean time, I'm looking for newer tools (if they even exist) that can do PSMC-like analyses for my data.

emmanuelparadis commented 2 years ago

Hello,

Yes, you shoud be able to obtain a psmcfa file from a VCF file and a reference genome file (also in FASTA) whatever the plarform used to generate the reads and their mapping.

Have you seen the package Rsubread on BioConductor? It's quite simple to use and it can map short or long reads; the results are output in a BAM file.

Best, E.

sjfleck commented 2 years ago

I'm not familiar with Rsubread, but I can try it out

As I mentioned, I've had many errors during the mpileup step (using the ancient bamtools/bcftools and original PSMC) with BAM files created with minimap2. I'll give Rsubread a try and use the newest bcftools to create a VCF to use with your PSMCR

sjfleck commented 2 years ago

I'm having some trouble running the VCF2DNAbin function. My goal was to run it on my university's cluster instead of running it locally. I saw "The function VCF2DNAbin takes as input the VCF file and the reference genome and outputs a consensus sequence" but I had a tough time figuring out how to use "Rscript $PSMCR/VCF2DNAbin.R."

Since I wasn't making progress submitting it to my cluster, I decided to run it locally in R: VCF2DNAbin("file.vcf.gz", refgenome = "file.fasta.gz", individual = 1, quiet = FALSE)

But ended up with an error: Error in read.FASTA(refgenome) : could not find function "read.FASTA"

I found a package with the read.FASTA() function, but after loading library(seqinr), it still gave the same error. I also tried:

VCF2DNAbin("file.vcf.gz", refgenome = NULL, individual = 1, quiet = FALSE)

but it also gave the same error. Any help with solving this would be greatly appreciated. Thank you in advance

emmanuelparadis commented 2 years ago

It's far far better that you install the package and load it: this will import all the functions you need. Rscript is mostly used to run R in batch mode (I usually prefer to use R CMD BATCH).

sjfleck commented 2 years ago

Thank you for your help. Your suggestion to use "R CMD BATCH" was very useful and I was able to create the plots I needed. I have 2 follow up questions:

(1.) can this version also plot the bootstraps onto the plot, such as in this published PSMC plot from Nadachowska-Brzyska et al. (2016)? mec13540-fig-0008-m

(2.) I haven't been able to find an option for it, but can plot.psmc transform the X and Y axes? For example, using a log transformation on the X-axis so I can see the recent events more clearly. I also think it would be easier to look at if the Y-axis values could be scaled by a custom power of 10 (i.e. 10^5 for my current plots).

Any help would be greatly appreciated. Thank you

emmanuelparadis commented 2 years ago

(1) Yes the default is to show the bootstrap values: like this.

(2) Yes you can use the standard option log which can be one of these: log = "x", log = "y", log = "xy". Be careful that by default 0 is included in the range of values on the x-axis because it is the present time: this will cause a problem if you log-transform this axis. You can then change the option xlim which is NULL by default to something like: xlim = c(1e4, 1e7).

Best, Emmanuel

sjfleck commented 2 years ago

For some reason, my plot doesn't show the bootstraps. First, I create my .rds files with 100 bootstraps:

library(ape) library(psmcr) ref<-psmcr::VCF2DNAbin("sample_1.vcf.gz", refgenome = "sample_1.fasta", individual = 1, quiet = FALSE) ref2<-psmcr::seqBinning(ref, bin.size=100) fil<-tempfile(pattern = "data", tmpdir = tempdir(), fileext = ".rds") saveRDS(ref2, file = 'data.rds') x<-readRDS('data.rds') o <- psmc(x, parapatt = "4+25*2+4+6", maxt = 10, niters = 20, B = 100, trunksize = 5e4, mc.cores = 4) saveRDS(o, "PSMCR.sample_1.output_run_psmc.rds")

I do this for sample 2 and 3 as well. Then I create a combined plot:

library(ape) library(psmcr) PSMC1 <- readRDS("PSMCR.sample_1.output_run_psmc.rds") PSMC2 <- readRDS("PSMCR.sample_2.output_run_psmc.rds") PSMC3 <- readRDS("PSMCR.sample_3.output_run_psmc.rds") pdf("PSMCR.Samples.pdf") plot(PSMC1, mutation.rate = 2.5e-9, g = 15, col = "blue", ylab = "Effective Population Size (Ne)", xlab = "Time (yr) (generation time = 15, mutation rate = 2.5 x 10^-9)", log = 'x', xlim = c(1e3, 5e6)) lines(PSMC2, mutation.rate = 2.5e-9, g = 15, col = "green") lines(PSMC3, mutation.rate = 2.5e-9, g = 15, col = "red")

PSMCR.Fraxinus.call_c.pdf

But I'm only getting a singe line for each sample. Do you see where I might be causing this to happen? Thank you

emmanuelparadis commented 2 years ago

Did you try to plot directly the object o (i.e., before saving )? Can you send me the file 'PSMCR.sample_1.output_run_psmc.rds'?

sjfleck commented 2 years ago

I didn't plot the object before saving this time around, but I can give it a shot.

Here is the file: PSMCR.sample_1.output_run_psmc.rds.zip

emmanuelparadis commented 2 years ago

This ZIP contains only an empty folder...

sjfleck commented 2 years ago

I apologize for that. Here is a zip file that should contain the file

PSMCR.sample_1.output_run_psmc.rds.zip

emmanuelparadis commented 2 years ago

Thanks! It was a bug that made the bootstrap lines invisible (actually too much transparency). This should be fixed. You need to reinstall from GH.