IPD mean values - Githubissues

clarepacini commented 6 years ago

Hi,

I am trying to assign base modifications on a per read basis. I can extract the raw IPD data for each read from the bam file. However, I do not have the corresponding in silico values that are stored in the h5 files - would it be possible to provide these? Or a script that can parse the h5 files and give me the IPD reference models for each context? I also have a couple of questions to check my understanding. Whilst the coverage is calculated from all reads that match at the modified base, it seems that only a subset of these are used for the IPD calculations as those reads that also match the reference on either side of the modified base (three bases in total). So that whilst coverage may be given as say, 80, possibly less than 80 reads are used in the calculation. Is this correct? Further, whilst you discuss using a t-test, is this actually a multivariate t-test using a vector of IPD values around the context, rather than averaging all IPD values to a single value?

Apologies for all the questions, any help would be appreciated!

Thanks,

Clare

rhallPB commented 6 years ago

The storage of the model in the h5 file isn't as simple as a list of mean IPD values for specific context. You would need to dive into https://github.com/PacificBiosciences/kineticsTools/blob/master/kineticsTools/ipdModel.py to see how the model data is actually used. The coverage is for a region of 12 bases around the base being tested, there is also a mapQV cutoff. The t-test is on the IPD of the modified base, the IPD of the modified base is predicted using the 12 base context. A GTB machine learning algorithm is used to predict if a base is modified, using the 12 base context.

One thing I would point out, there is not enough signal in a single read to get a meaningful prediction of base modification.

clarepacini commented 6 years ago

Thanks! That is useful information. I was planning to look only at those bases that have been identified as modified by ipdsummary and then try fitting a mixture distribution at those locations - similar to those calculations used to estimate the methylated fraction statistics e.g. in the estimateSingleFraction function. I have looked a little at ipdModel.py, by any chance to do you have any documentation for the h5 model files to explain what LeftNodes, RelativeInfluence, Variables etc are? Or is it possible to get the data and scripts you used to fit the gbm in R?

GDelevoye commented 3 years ago

Hello @clarepacini

Sometimes in academics, research projects take lots of time, so maybe even in 2021 you could still give a look to a small repackaging I did a while ago to extract the in-sillico values for any given sequence

It's motsly just repackaging of PacBio's codes in python3 + a bit of retro-ingineering, I did almost nothing and I didn't have time to test everything properly. It seems, however, that it works correctly with the SP2-C2 model (maybe also the others, idk)

guillaume@A320MA:~$ conda install -c gdelevoye ipdtools
guillaume@A320MA:~$ ipdtools --help
usage: ipdtools [-h] [--model {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}] --fastafile FASTAFILE --output_csv OUTPUT_CSV [--verbosity {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--progress_bar] [--nproc NPROC]
                [--indexing {0,1}]

optional arguments:
  -h, --help            show this help message and exit
  --model {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}, -m {SP2-C2,C2,P4-C2,P5-C3,P6-C4,XL-C2,XL-XL}
                        Choose the model for IPD prediction. See the package's README for more info. DEFAULT: SP2-C2 (PacBio Sequel I)
  --fastafile FASTAFILE, -f FASTAFILE
                        Path to a fasta file.
  --output_csv OUTPUT_CSV, -o OUTPUT_CSV
                        Output CSV file of predicted IPDs. See the README for further details on the output.
  --verbosity {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -v {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Choose your verbosity. Default: INFO
  --progress_bar, -p    Displays a progress bar
  --nproc NPROC, -n NPROC
                        Max number of processors for parallelism. DEFAULT: 1, positive integers only. Rq: The programm will actually use n+1 >= 2 CPU no matter what.

It takes quite long computationnal time and can produce big tables; it's really not well optimized (but it seems to work)

PacificBiosciences / kineticsTools

IPD mean values #53