PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

Installing kineticsTools #83

Closed venkata14 closed 3 years ago

venkata14 commented 3 years ago

Hey, I'm new to programming. How are you supposed to install this package? I installed pbcommands and pbcore and have done python3 setup.py build and install. However it seems to not be working still. I am doing this on debian instanace in GCP.

GDelevoye commented 3 years ago

Hi,

I'm not from PacBio but I worked a lot with this program and I even modified it many times for my own purposes

All pacbio programs are essentially in python2 which is really annoying. Between python2, virtualenv and everything, you might face lots of problems during the installation

If you just want to run it on your data without modifying anything, perhaps you'd prefer installing smrttools instead

I know it can be confusing at first (PacBio's pipelines are not really distributed according to the best recommended practices and ISO norms ...) so here are some key explanations:

Don't hesitate if you need other explanations

venkata14 commented 3 years ago

Hi Guillaume!

You are a life saver! Thank you so much!

Another question, so I have Pacbio RS data and not RS II data, will the "smrttools" version 7 work on RS data? The website only says it works for RS II data but does not specify RS data.

Also this is new to me so I'm a little confused. ipdSummary requires a .bam file so I'm assuming I have to first use bowtie2 to take my .fastq files and align them to my reference genome and it spits out a .sam file. I've been reading about .sam/.bam files and there is nothing that says that these files store IPDs. So how is ipdSummary able to give me the IPD for each base pair?

(Sorry another question, is the out put of the ipdSummary a csv where next to each base pair in the reference genome, there is an IPD?)

Thank you so much! Venkata

GDelevoye commented 3 years ago

The fasta file will just contain the nucleotides and nothing else by definition

The format of PacBio data has changed between RS I, RSII, Sequel I and Sequel II. It used to be stored in "h5 files", which has then be changed to the .bam files with all the subreads.

In other words you don't have to align anything against anything to obtain a .bam file: Raw output files of new pacbio sequencers are already .bam files from scratch, not-aligned, without any consensus, just pure raw unaligned data. The format of these .bam is not a standard .bam and not ISO-compliant; described here PB .bam Format . Be carefull at the version of the documentation you're looking

Of course you don't have the .bam since your data is quite old. But If you can still put an hand on the original .h5 files of your sequencings you'll have everything you need (actually, the old .h5 format even contained more data than the new sequencers), including the IPDs. These files are much bigger than your .fasta files !

It is possible to convert the old .h5 format into the "new" .bam files using the program "baxtobam" baxtobam, which is also included in the SMRTAnalysis suite. I recommend that you find the original .h5 file of your data and that you convert everything into the .bam format since lots of tools exist for it which will be very usefull for you, and the .h5 format is not so usefull and brings more problems than solutions.

I am not 100% sure, but I think your data should not be analyzed with the latest version of smrtAnalysis. I think have seen somewhere (but I couldn't find the url sorry) that the old RS data should be analyzed with an old version of smrtAnalysis only. Don't trust me too much on this point, I'm not 100% confident about what I'm saying. If you're doing so, be carefull because some of the old versions of smrttools used with the in-sillico control can produce bad results and artifactual false positives, a bug that was fixed in the most recent versions. Unfortunately the old Github referencing these issues has disappeared

Concerning the output of ipdSummary, it is described here ipdSummary documentation

The IPDs are not outputted by ipdSummary (this would be way too big !), they are just raw inputs in your .bam file, starting from which you can start analyzing DNA modifications. The output will differ wether you're using a "Whole Genome Amplified" (WGA) control or the "in-sillico control" (do you have a whole-genome amplified control ?)

You can't just "read" the ipds with samtools view of your .bam file, or in any file. The way IPDs are encoded is called "lossy-encoding", such as described in the PacBio .bam format descriptor I sent you above (first link). If you need really low-level informations about IPDs, you can let me know since I have several homemade but efficient tools for manipulating them (which you would'nt need if you're doing basic analysis, but you could require it for much detailed works).

Perhaps this discussion could be marked as closed if I answered to all your questions ?

Don't hesitate to send me an email (see my GitHub profile) if you have further questions. Don't hesitate really. Like you, I work/worked with outdated data, and resources can be hard to find, so... sharing is always great

venkata14 commented 3 years ago

Yes I absolutely will thank you so much!