PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
21 stars 21 forks source link

Addition to issues45 and issues46 #60

Open AlexWanghaoming opened 6 years ago

AlexWanghaoming commented 6 years ago

Dear developers, Just as @JohnUrban described long time ago, I met the same questions as #45. But my alignment format is bam. My command is: ipdSummary ../modificationAlignment_sort.bam --reference ../fg_polished.fasta --methylFraction -j 5 --identify m6A,m5C_TET,m4C --gff basemods.gff --csv basemods.csv The process always hanging at the same point every time I re-run through I use --refContigs parament. In your commit, it looks like you just make a change for cmp.h5 format.Could you please help me? Hope you can reply! yours, Alex

JonathanGordon80 commented 6 years ago

Hi,

I am seeing this same issue. If I run ipdSummary on a bam file with the -v (verbose) flag I can see that it keeps running, but stops outputting to the results file at the same point during each run, which appears to coincide with a switch in the analysed contig. Did you have any luck in getting this working?

Command: ipdSummary bamfile.sorted.bam --reference consensus.fasta --gff basemods.gff --pvalue 0.001 --numWorkers 40 --identify m4C,m6A,m5C_TET --minCoverage 3 --methylMinCov 10 -v

Cheers,

Jonathan

AlexWanghaoming commented 6 years ago

Dear JonathanGordon80, I am sorry to tell you that I have not got a solution. I have try to transfer bam to cmp.h5 with samtoh5, but it triggered another error. Hope you can succeed. Regards, Alex

rhallPB commented 6 years ago

Have you tried running with default parameters? The minCoverage, numWorkers and pvalue are far from what is normally tested, so it's possible you are just hitting an edge case. Also the m5C_TET identification is no longer supported due to the difficulty of generating m5C_TET data for testing (TET1 enzyme is no longer easy to source)

JonathanGordon80 commented 6 years ago

Thank you for your responses! I will try to run it again with default parameters and update you when it is done.

JohnUrban commented 6 years ago

Hi all,

I never resolved the issue in #45. The issue occurs even when processing each contig individually. There are several contigs where it hangs infinitely in the middle of writing the basemods.gff file.

It appears to happen on very large contigs, though this could just reflect the higher probability of encountering a problematic region. For example, in one assembly, ipdSummary stalls out on 19 of the 32 contigs > 2.5 Mb (and nothing smaller than that).

Can anyone point to which file would be best to start the debugging process in? Perhaps I will find some time in the next couple of days to gather more clues.

JohnUrban commented 6 years ago

p.s. The stalling is also present in the kinetics.csv file -- which I believe stalls at a position a little further ahead of where basemods.gff shows stalling.

JonathanGordon80 commented 6 years ago

Hi, I tested running ipdSummary with different parameters, and it worked when I didn't use the numWorkers parameter, so it seems to be a parallelization issue. This makes sense with what I observed while running it when it was crashing - the gff file would stop being written to disk around the time of a transition from one contig to another, and one of the threads would start to take up more and more memory until it ran out and crashed. Obviously it takes longer to run on one thread, but I guess it could be parallelized by splitting the bam and reference files into individual contigs and running them in parallel.

Thanks for the input, and I hope everyone manages to get it running.

JohnUrban commented 6 years ago

Some of the stalls I described above eventually un-stalled and the ipdSummary completed the contig. So it was just a matter of waiting a really long time.

Based on Jonathan's feedback, I guess this problem may arise from more than one source. I've only had one contig consume too much memory - and am looking into it. Other than that, the stalling issue has not been a memory issue for me. Moreover, it has generally stalled anywhere on a contig, from beginning to middle to end. Having said that, I process each contig individually and encounter the stalls at the same positions when doing so as when processing them all together. So my problem is independent of transitioning between contigs in general.

I checked a few of the spots, and they tend to be of higher than average coverage -- BUT other sites of even higher coverage are processed fine. So it may play a role, but is not the whole story. I've messed around w/ limiting --maxCoverage, but it has not seemed to change it. I have not yet limited --maxAlignments.

The only way I have found so far to plow right through all the stalls I've checked is this:

ipdSummary ${INPUT} --reference $REF  \
    --gff basemods.gff --csv kinetics.csv --refContigs ${contigName}

In other words, I barely specify any options/flags -- using only the defaults.

Based on a ton of other tests I've done -- I actually think what is letting it move on is not specifying anything for --identify. This results in ipdSummary not attempting to identify the modification (e.g. no attempts for 4mC, 5mC, 6mA) -- all are just called modified_base.

So overall -- the hanging may be during the step that tries to identify what type of modified base it is.

JohnUrban commented 6 years ago

For those following along, here are my updated thoughts.

  1. To see the exact window information of where ipdSummary is "hanging", make sure to specify "-v" a few times for verbosity. It is easier to interpret when -j 1. Without verbosity, those most accurate clues would be in the kinetics.csv file. ipdSummary by default processes 1000 bp at a time. So, the most recent bases reported at the end of the CSV file will tell you which window was processed last. The hang is in the subseqnt, yet unreported window.

NOTE: In the verbose output you will see that for a given 1000 bp, it adds 15 bp flanks to each side during processing. For example, for the window 1000-2000, it looks at 958-2015. The bases analyzed and reported on will only be 1000-2000 though. I assume the 15 bp flanks are there b/c the modification detection process is looking at 15 bp to each side of a central base.

  1. When ipdSummary "hangs" or "freezes" at a given position, the hang does not seem to be infinite, but can take a very long time. In some cases, it can consume a ton of memory in addition to taking a very long time. With enough resources and waiting, I tend to believe all the hangs would finish eventually -- but cannot say this definitively.

  2. If you're not processing contigs separately in parallel (--refContigs), you should be doing that. The majority of contigs (and genome) will finish without issue. The issues tend to arise on a subset of the longest contigs -- but this may only reflect the higher probability of running into a problematic window. The hangs can occur anywhere within the contigs: beginning, middle, end.

  3. After successfully finishing the majority of contigs, I have a work-around for finishing the final contigs. This can likely be automated, but most will need/want to do it manually. Note that it is a work-around, but not a fix to the code, nor a final solution. The work-around:

    • For the problem contig, identify the problem window
JohnUrban commented 6 years ago

Perhaps interesting for the developers (maybe @rhallPB ): I believe I know what is causing these hang-ups. I'd be happy for others to chime in if they can confirm/deny.

In the majority of instances I looked at (i.e. 99%), the window contained long poly-C or poly-G tracts. The few windows that didn't had long-ish poly-A or poly-T tracts. 100% of windows had long homopolymers, which are likely the issue in general. I am guessing that the --identify operations are having trouble assigning the modification signal to one or more of the homo-polymer bases while "masking" the nearby signals from others. I'm also guessing signal alignment is an issue here as well.

Regardless of the "mechanism" behind the hang-ups, the correlation w/ long homopolymers is clear.

Here are some examples from those windows when not using --identify.

Example 1:

contig_106      kinModCall      modified_base   4281557 4281557 33      +       .       coverage=17;context=CCGAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCG;IPDRatio=5.49
contig_106      kinModCall      modified_base   4281558 4281558 25      +       .       coverage=17;context=CGAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGA;IPDRatio=3.76
contig_106      kinModCall      modified_base   4281559 4281559 25      +       .       coverage=17;context=GAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGAA;IPDRatio=3.28
contig_106      kinModCall      modified_base   4281561 4281561 21      -       .       coverage=10;context=GATTCGGACAACGGGGGGGGGGGGGGGGGGTATTTACGAAT;IPDRatio=2.91
contig_106      kinModCall      modified_base   4281564 4281564 22      +       .       coverage=14;context=CGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGAATCTGT;IPDRatio=4.18
contig_106      kinModCall      modified_base   4281566 4281566 22      -       .       coverage=13;context=GGACAGATTCGGACAACGGGGGGGGGGGGGGGGGGTATTTA;IPDRatio=3.67

Example 2:

contig_151      kinModCall      modified_base   949161  949161  25      -       .       coverage=17;context=CATTTTTCGGACGTCCACCGGGGGGGGGGGGGGTTGTATCG;IPDRatio=2.71
contig_151      kinModCall      modified_base   949191  949191  29      -       .       coverage=17;context=ATGTGAATCATTAAAAAAGGTGGAAAATTTCATTTTTCGGA;IPDRatio=3.17
contig_151      kinModCall      modified_base   949259  949259  25      -       .       coverage=17;context=ACGCAAGTTTGCCGAAGTGCTAGTACTTAACTTTCGACTTC;IPDRatio=3.42
contig_151      kinModCall      modified_base   949808  949808  21      +       .       coverage=11;context=GAATTCGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCC;IPDRatio=4.44
contig_151      kinModCall      modified_base   949812  949812  30      +       .       coverage=12;context=TCGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;IPDRatio=8.72
contig_151      kinModCall      modified_base   949813  949813  26      +       .       coverage=12;context=CGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCG;IPDRatio=5.34
contig_151      kinModCall      modified_base   949814  949814  27      +       .       coverage=12;context=GTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGT;IPDRatio=8.30
contig_151      kinModCall      modified_base   949815  949815  21      +       .       coverage=12;context=TAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGTT;IPDRatio=9.73
contig_151      kinModCall      modified_base   949821  949821  21      +       .       coverage=11;context=CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGTTGTCCGA;IPDRatio=4.80
c

Example 3:

contig_176      kinModCall      modified_base   29583   29583   23      +       .       coverage=16;context=ACGTTACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAAC;IPDRatio=3.95
contig_176      kinModCall      modified_base   29584   29584   21      -       .       coverage=21;context=CGTTACAGGGGGGGGGGGGGGGGGGGGGTATGGCAGTAACG;IPDRatio=4.78
contig_176      kinModCall      modified_base   29587   29587   24      -       .       coverage=23;context=CAGCGTTACAGGGGGGGGGGGGGGGGGGGGGTATGGCAGTA;IPDRatio=2.08
contig_176      kinModCall      modified_base   29587   29587   30      +       .       coverage=17;context=TACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTG;IPDRatio=4.21
contig_176      kinModCall      modified_base   29588   29588   27      +       .       coverage=17;context=ACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGT;IPDRatio=6.57
contig_176      kinModCall      modified_base   29589   29589   30      +       .       coverage=17;context=CTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTG;IPDRatio=5.55
contig_176      kinModCall      modified_base   29590   29590   21      +       .       coverage=17;context=TGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTGT;IPDRatio=3.85
contig_176      kinModCall      modified_base   29596   29596   23      +       .       coverage=12;context=ACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTGTAACGCT;IPDRatio=9.28

Example 4:

contig_635      kinModCall      modified_base   679041  679041  22      -       .       coverage=17;context=TTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGCTGT;IPDRatio=4.93
contig_635      kinModCall      modified_base   679041  679041  21      +       .       coverage=23;context=ACAGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAA;IPDRatio=2.37
contig_635      kinModCall      modified_base   679042  679042  25      -       .       coverage=17;context=GTTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGCTG;IPDRatio=5.09
contig_635      kinModCall      modified_base   679042  679042  24      +       .       coverage=22;context=CAGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAAC;IPDRatio=2.04
contig_635      kinModCall      modified_base   679043  679043  21      +       .       coverage=22;context=AGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAACG;IPDRatio=2.78
contig_635      kinModCall      modified_base   679044  679044  21      -       .       coverage=16;context=ACGTTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGC;IPDRatio=3.59
contig_635      kinModCall      modified_base   679046  679046  24      +       .       coverage=20;context=GTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAACGTTA;IPDRatio=5.92

And on and on and on.......