Open AlexWanghaoming opened 6 years ago
Hi,
I am seeing this same issue. If I run ipdSummary on a bam file with the -v (verbose) flag I can see that it keeps running, but stops outputting to the results file at the same point during each run, which appears to coincide with a switch in the analysed contig. Did you have any luck in getting this working?
Command: ipdSummary bamfile.sorted.bam --reference consensus.fasta --gff basemods.gff --pvalue 0.001 --numWorkers 40 --identify m4C,m6A,m5C_TET --minCoverage 3 --methylMinCov 10 -v
Cheers,
Jonathan
Dear JonathanGordon80, I am sorry to tell you that I have not got a solution. I have try to transfer bam to cmp.h5 with samtoh5, but it triggered another error. Hope you can succeed. Regards, Alex
Have you tried running with default parameters? The minCoverage, numWorkers and pvalue are far from what is normally tested, so it's possible you are just hitting an edge case. Also the m5C_TET identification is no longer supported due to the difficulty of generating m5C_TET data for testing (TET1 enzyme is no longer easy to source)
Thank you for your responses! I will try to run it again with default parameters and update you when it is done.
Hi all,
I never resolved the issue in #45. The issue occurs even when processing each contig individually. There are several contigs where it hangs infinitely in the middle of writing the basemods.gff file.
It appears to happen on very large contigs, though this could just reflect the higher probability of encountering a problematic region. For example, in one assembly, ipdSummary stalls out on 19 of the 32 contigs > 2.5 Mb (and nothing smaller than that).
Can anyone point to which file would be best to start the debugging process in? Perhaps I will find some time in the next couple of days to gather more clues.
p.s. The stalling is also present in the kinetics.csv file -- which I believe stalls at a position a little further ahead of where basemods.gff shows stalling.
Hi, I tested running ipdSummary with different parameters, and it worked when I didn't use the numWorkers parameter, so it seems to be a parallelization issue. This makes sense with what I observed while running it when it was crashing - the gff file would stop being written to disk around the time of a transition from one contig to another, and one of the threads would start to take up more and more memory until it ran out and crashed. Obviously it takes longer to run on one thread, but I guess it could be parallelized by splitting the bam and reference files into individual contigs and running them in parallel.
Thanks for the input, and I hope everyone manages to get it running.
Some of the stalls I described above eventually un-stalled and the ipdSummary completed the contig. So it was just a matter of waiting a really long time.
Based on Jonathan's feedback, I guess this problem may arise from more than one source. I've only had one contig consume too much memory - and am looking into it. Other than that, the stalling issue has not been a memory issue for me. Moreover, it has generally stalled anywhere on a contig, from beginning to middle to end. Having said that, I process each contig individually and encounter the stalls at the same positions when doing so as when processing them all together. So my problem is independent of transitioning between contigs in general.
I checked a few of the spots, and they tend to be of higher than average coverage -- BUT other sites of even higher coverage are processed fine. So it may play a role, but is not the whole story. I've messed around w/ limiting --maxCoverage
, but it has not seemed to change it. I have not yet limited --maxAlignments
.
The only way I have found so far to plow right through all the stalls I've checked is this:
ipdSummary ${INPUT} --reference $REF \
--gff basemods.gff --csv kinetics.csv --refContigs ${contigName}
In other words, I barely specify any options/flags -- using only the defaults.
Based on a ton of other tests I've done -- I actually think what is letting it move on is not specifying anything for --identify
. This results in ipdSummary not attempting to identify the modification (e.g. no attempts for 4mC, 5mC, 6mA) -- all are just called modified_base
.
So overall -- the hanging may be during the step that tries to identify what type of modified base it is.
For those following along, here are my updated thoughts.
-j 1
. Without verbosity, those most accurate clues would be in the kinetics.csv file. ipdSummary by default processes 1000 bp at a time. So, the most recent bases reported at the end of the CSV file will tell you which window was processed last. The hang is in the subseqnt, yet unreported window.NOTE: In the verbose output you will see that for a given 1000 bp, it adds 15 bp flanks to each side during processing. For example, for the window 1000-2000, it looks at 958-2015. The bases analyzed and reported on will only be 1000-2000 though. I assume the 15 bp flanks are there b/c the modification detection process is looking at 15 bp to each side of a central base.
When ipdSummary "hangs" or "freezes" at a given position, the hang does not seem to be infinite, but can take a very long time. In some cases, it can consume a ton of memory in addition to taking a very long time. With enough resources and waiting, I tend to believe all the hangs would finish eventually -- but cannot say this definitively.
If you're not processing contigs separately in parallel (--refContigs
), you should be doing that. The majority of contigs (and genome) will finish without issue. The issues tend to arise on a subset of the longest contigs -- but this may only reflect the higher probability of running into a problematic window. The hangs can occur anywhere within the contigs: beginning, middle, end.
After successfully finishing the majority of contigs, I have a work-around for finishing the final contigs. This can likely be automated, but most will need/want to do it manually. Note that it is a work-around, but not a fix to the code, nor a final solution. The work-around:
Instead of using --refContigs contigName
, use --referenceWindows contigName:0-end1,contigName:start2-contigEnd
. In other words, provide 2 windows that leave the single problem window out. You might encounter additional problematic windows this way -- and the work-around would simply be to provide N windows that exlude the N-1 problem windows.
Optional - salvaging some information for problem windows: When the non-problem windows are done, and all problem windows are known -- you can salvage some base modification information from them by not using the --identify
option. As a result, for that window you will get a list of positions with a modified_base
, but no identity (i.e. m5C, m4C, m6A). As an alternative, it is possible that only one of those is causing the problem (e.g. m5C) and using --identify
with the non-problematic subset would work as well.
After all windows are processed, the basemods.gff and kinetics.csv files can be merged through various linux operations (cat, grep, sort, awk).
Perhaps interesting for the developers (maybe @rhallPB ): I believe I know what is causing these hang-ups. I'd be happy for others to chime in if they can confirm/deny.
In the majority of instances I looked at (i.e. 99%), the window contained long poly-C or poly-G tracts. The few windows that didn't had long-ish poly-A or poly-T tracts. 100% of windows had long homopolymers, which are likely the issue in general. I am guessing that the --identify
operations are having trouble assigning the modification signal to one or more of the homo-polymer bases while "masking" the nearby signals from others. I'm also guessing signal alignment is an issue here as well.
Regardless of the "mechanism" behind the hang-ups, the correlation w/ long homopolymers is clear.
Here are some examples from those windows when not using --identify
.
Example 1:
contig_106 kinModCall modified_base 4281557 4281557 33 + . coverage=17;context=CCGAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCG;IPDRatio=5.49
contig_106 kinModCall modified_base 4281558 4281558 25 + . coverage=17;context=CGAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGA;IPDRatio=3.76
contig_106 kinModCall modified_base 4281559 4281559 25 + . coverage=17;context=GAATTCGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGAA;IPDRatio=3.28
contig_106 kinModCall modified_base 4281561 4281561 21 - . coverage=10;context=GATTCGGACAACGGGGGGGGGGGGGGGGGGTATTTACGAAT;IPDRatio=2.91
contig_106 kinModCall modified_base 4281564 4281564 22 + . coverage=14;context=CGTAAATACCCCCCCCCCCCCCCCCCGTTGTCCGAATCTGT;IPDRatio=4.18
contig_106 kinModCall modified_base 4281566 4281566 22 - . coverage=13;context=GGACAGATTCGGACAACGGGGGGGGGGGGGGGGGGTATTTA;IPDRatio=3.67
Example 2:
contig_151 kinModCall modified_base 949161 949161 25 - . coverage=17;context=CATTTTTCGGACGTCCACCGGGGGGGGGGGGGGTTGTATCG;IPDRatio=2.71
contig_151 kinModCall modified_base 949191 949191 29 - . coverage=17;context=ATGTGAATCATTAAAAAAGGTGGAAAATTTCATTTTTCGGA;IPDRatio=3.17
contig_151 kinModCall modified_base 949259 949259 25 - . coverage=17;context=ACGCAAGTTTGCCGAAGTGCTAGTACTTAACTTTCGACTTC;IPDRatio=3.42
contig_151 kinModCall modified_base 949808 949808 21 + . coverage=11;context=GAATTCGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCC;IPDRatio=4.44
contig_151 kinModCall modified_base 949812 949812 30 + . coverage=12;context=TCGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;IPDRatio=8.72
contig_151 kinModCall modified_base 949813 949813 26 + . coverage=12;context=CGTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCG;IPDRatio=5.34
contig_151 kinModCall modified_base 949814 949814 27 + . coverage=12;context=GTAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGT;IPDRatio=8.30
contig_151 kinModCall modified_base 949815 949815 21 + . coverage=12;context=TAAATACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGTT;IPDRatio=9.73
contig_151 kinModCall modified_base 949821 949821 21 + . coverage=11;context=CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGTTGTCCGA;IPDRatio=4.80
c
Example 3:
contig_176 kinModCall modified_base 29583 29583 23 + . coverage=16;context=ACGTTACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAAC;IPDRatio=3.95
contig_176 kinModCall modified_base 29584 29584 21 - . coverage=21;context=CGTTACAGGGGGGGGGGGGGGGGGGGGGTATGGCAGTAACG;IPDRatio=4.78
contig_176 kinModCall modified_base 29587 29587 24 - . coverage=23;context=CAGCGTTACAGGGGGGGGGGGGGGGGGGGGGTATGGCAGTA;IPDRatio=2.08
contig_176 kinModCall modified_base 29587 29587 30 + . coverage=17;context=TACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTG;IPDRatio=4.21
contig_176 kinModCall modified_base 29588 29588 27 + . coverage=17;context=ACTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGT;IPDRatio=6.57
contig_176 kinModCall modified_base 29589 29589 30 + . coverage=17;context=CTGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTG;IPDRatio=5.55
contig_176 kinModCall modified_base 29590 29590 21 + . coverage=17;context=TGCCATACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTGT;IPDRatio=3.85
contig_176 kinModCall modified_base 29596 29596 23 + . coverage=12;context=ACCCCCCCCCCCCCCCCCCCCCTGTAACGCTGTGTAACGCT;IPDRatio=9.28
Example 4:
contig_635 kinModCall modified_base 679041 679041 22 - . coverage=17;context=TTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGCTGT;IPDRatio=4.93
contig_635 kinModCall modified_base 679041 679041 21 + . coverage=23;context=ACAGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAA;IPDRatio=2.37
contig_635 kinModCall modified_base 679042 679042 25 - . coverage=17;context=GTTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGCTG;IPDRatio=5.09
contig_635 kinModCall modified_base 679042 679042 24 + . coverage=22;context=CAGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAAC;IPDRatio=2.04
contig_635 kinModCall modified_base 679043 679043 21 + . coverage=22;context=AGCGTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAACG;IPDRatio=2.78
contig_635 kinModCall modified_base 679044 679044 21 - . coverage=16;context=ACGTTACTGCCATACCCCCCCCCCCCCCCCCCCTGTAACGC;IPDRatio=3.59
contig_635 kinModCall modified_base 679046 679046 24 + . coverage=20;context=GTTACAGGGGGGGGGGGGGGGGGGGTATGGCAGTAACGTTA;IPDRatio=5.92
And on and on and on.......
Dear developers, Just as @JohnUrban described long time ago, I met the same questions as #45. But my alignment format is bam. My command is: ipdSummary ../modificationAlignment_sort.bam --reference ../fg_polished.fasta --methylFraction -j 5 --identify m6A,m5C_TET,m4C --gff basemods.gff --csv basemods.csv The process always hanging at the same point every time I re-run through I use --refContigs parament. In your commit, it looks like you just make a change for cmp.h5 format.Could you please help me? Hope you can reply! yours, Alex