a-slide / pycoQC

pycoQC computes metrics and generates Interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecaller (Albacore/Guppy)
https://a-slide.github.io/pycoQC/
GNU General Public License v3.0
267 stars 41 forks source link

Many contigs in reference break pycoQC report html (bam analysis) #106

Closed Marc-Ruebsam closed 4 years ago

Marc-Ruebsam commented 4 years ago

Describe the bug I have a reference with multiple (695171) individual sequences (/contigs) in it (16S references). I used minimap2 to align my reads and samtools to create a indexed bam file. I ran

$ pycoQC --sample 100000 --summary_file path/to/sequencing_summary_barcode03.txt --bam_file path/to/barcode03_alignment_sorted.bam --html_outfile pycoQC_report.html --json_outfile pycoQC_report.json

without an error.

When I try to open the pycoQC_report.html in Chrome it shows the initial plots, but freezes and crashes eventually. Firefox isn't even getting that far and crashes without showing content at all.

I already was expecting something like this would happen as I have issues with NanoPlot for this analysis as well, but not for an alignment to a single reference. As I expected the report contains very long lines (one with 27M characters containing all the contig names as labels):

$ awk '{print NR,length($0)}' pycoQC_report.html | sort -k2 -n | tail -n 3
224 497780
392 1285181
464 27237450
$ awk 'NR==464{ for(i=1; i<=1000; i++) {print $i} }' pycoQC_report.html
{ for(i=1; i<=1000; i++) {print $i} }' pycoQC_report.html
{"height":
500,
"hovermode":
"closest",
"legend":
{"x":
-0.2,
"xanchor":
"left",
"y":
1,
"yanchor":
"top"},
"plot_bgcolor":
"whitesmoke",
"shapes":

...

"xaxis":
{"showgrid":
false,
"showline":
true,
"tickangle":
-45,
"ticktext":
["GY203941.1.1493",
"GY324971.1.1500",
"JQ765433.1.1505",
"JQ765578.1.1444",
"JQ766308.1.1248",

...

Is there a way to render the images outside the browser?

To Reproduce Steps to reproduce the behavior:

  1. get a reference with around 700000 contigs
  2. align some reads to it using minimap2
  3. create a pycoQC report
  4. open the report in browser

Versions

Kind regards

a-slide commented 4 years ago

I am not surprised. This is am awful lot of contigs. Can you run pycoQC in verbose mode and copy the log.

My guess is that it breaks in the coverage plot. So I would also suggest to run pycoQC with a modified configuration file without the coverage plot => see how to modify the config file in the Advanced configuration with custon json file section of the command line usage https://a-slide.github.io/pycoQC/pycoQC/CLI_usage/

Marc-Ruebsam commented 4 years ago

Here is the log: pycoQC_report.log

But let me state again, the job itself is not crashing. The html report is unable to load in the browser. Thanks for the hint, I'll give the config adaptation a try.

Feel free to close.

a-slide commented 4 years ago

Pretty sure it's because of the coverage plot.

Marc-Ruebsam commented 4 years ago

Yes, me too. Just wanted to share this experience. Maybe it makes sense not to include a coverage plot for more than say 50 contigs per default. Wouldn't be readable anyway. Thanks for the support. Keep up the good work.

a-slide commented 4 years ago

That's True. Good suggestion I was also thinking of aggregating together references shorter than X% of the total genome len.

Marc-Ruebsam commented 4 years ago

FYI and anyone who might encounter the problem: As we expected, providing a config file without

},
  "alignment_coverage": {
    "plot_title": "Coverage overview",
    "nbins": 500,
    "color": "rgba(102,168,255,0.75)",
    "smooth_sigma": 1

solved the issue.

P.S.: solution was tested under version pycoQC v2.5.0.17 ... I had to upgrade because of #92

a-slide commented 4 years ago

Great thanks for you feedback