GenomiqueENS / toulligQC

A post sequencing QC tool for Oxford Nanopore sequencers
Other
79 stars 5 forks source link

barcoded P2 data, dorado-basecalled: TypeError: Invalid type for the value of the key n50: <class 'NoneType'> #28

Closed sklages closed 1 month ago

sklages commented 1 month ago

I have a few prom P2 flowcells, barcoded (SQK-NBD114, barcodes 1-4) basecalled with current dorado 0.7.2. As I see from #27 -- the barcode issue has been fixed in version 2.7.

Original *_summary.tsv has some 23M records. Running ToulligQC like this:


SMPL=MySample
POD5=/path/to/pod5
BSUM=${SMPL}.sup.5mCG_5hmCG.ubam_summary.mod.tsv

toulligqc \
  --thread 20 \
  --sequencing-summary-source ${BSUM} \
  --pod5-source ${POD5} \
  --report-name ${SMPL} \
  --html-report-path ${SMPL}.TQC-Report.html \
  --data-report-path ${SMPL}.TQC-Report.data \
  --barcoding \
  --barcodes barcode01:barcode04

.. results in an error:

ToulligQC version 2.7
* Initialize extractors
* Start Toulligqc info extractor
* End of Toulligqc info extractor (done in 0m0.00s)
* Start Pod5 extractor
* End of Pod5 extractor (done in 0m0.01s)
* Start Basecaller sequencing summary extractor
  - Load sequencing summary file (54.93 MB used) in 0m3.43s
Traceback (most recent call last):
  File "/path/to/common/bin/toulligqc", line 33, in <module>
    sys.exit(load_entry_point('toulligqc==2.7', 'console_scripts', 'toulligqc')())
  File "/path/to/common/lib/python3.10/site-packages/toulligqc-2.7-py3.10.egg/toulligqc/toulligqc.py", line 423, in main
    extractor.extract(result_dict)
  File "/path/to/common/lib/python3.10/site-packages/toulligqc-2.7-py3.10.egg/toulligqc/sequencing_summary_extractor.py", line 229, in extract
    set_result_value(self, result_dict, "n50", compute_NXX(self.dataframe_dict, 50))
  File "/path/to/common/lib/python3.10/site-packages/toulligqc-2.7-py3.10.egg/toulligqc/extractor_common.py", line 39, in set_result_value
    _check_result_key_value(key, value)
  File "/path/to/common/lib/python3.10/site-packages/toulligqc-2.7-py3.10.egg/toulligqc/extractor_common.py", line 66, in _check_result_key_value
    raise TypeError("Invalid type for the value of the key {}: {} ".format(key, type(value)))
TypeError: Invalid type for the value of the key n50: <class 'NoneType'>

Using a fraction of this file works fine with 2.3M records .. and fails when using 2.4M or more records ...

The file itself looks "sane" .. I am not really sure where to look at.

Any idea that puts me in the right direction? I probably miss here something ver basic, but I do not see what ..

sklages commented 1 month ago

... same error when using BAM as input ...

ToulligQC version 2.7
* Initialize extractors
* Start Toulligqc info extractor
* End of Toulligqc info extractor (done in 0m0.00s)
* Start Pod5 extractor
* End of Pod5 extractor (done in 0m0.02s)
* Start uBAM extractor
Processed: 23175500read [26:28, 14586.23read/s]
  - Load BAM file (530.44 MB used) in 27m5.82s
Traceback (most recent call last):
<...>
    raise TypeError("Invalid type for the value of the key {}: {} ".format(key, type(value)))
TypeError: Invalid type for the value of the key n50: <class 'NoneType'>
alihamraoui commented 1 month ago

Hi @sklages,

Thank you for reporting this issue.

To reproduce the error, could you please provide a summary or a sample of the 2.4 million records?

I suspect the problem might be related to dependencies. can you try it using the Docker command and see if the issue persists?

SMPL=MySample
POD5=/path/to/pod5
BSUM=/absolute/path/${SMPL}.sup.5mCG_5hmCG.ubam_summary.mod.tsv

docker run -ti \
             -u $(id -u):$(id -g) \
             --rm \
             -v ${POD5}:${POD5} \
             -v ${BSUM}:${BSUM} \
             genomicpariscentre/toulligqc:2.7  toulligqc \
                                                  --thread 20 \
                                                  --sequencing-summary-source ${BSUM} \
                                                  --pod5-source ${POD5} \
                                                  --report-name ${SMPL} \
                                                  --html-report-path ${SMPL}.TQC-Report.html \
                                                  --data-report-path ${SMPL}.TQC-Report.data \
                                                  --barcoding \
                                                  --barcodes barcode01:barcode04

Looking forward to your response.

Best regards,

sklages commented 1 month ago

thanks for your fast response, ..

This is a pip install, I have no docker available here ... so maybe there i another way to check the dependencies?

alihamraoui commented 1 month ago

Could you provide the output of pip list command? This will list all installed packages and their versions.

Best, Ali

sklages commented 1 month ago

This is the pip list output:

Package           Version
----------------- ---------
biopython         1.81
bokeh             3.3.1
contourpy         1.0.7
cycler            0.11.0
Cython            0.29.36
dominate          2.9.1
ezcharts          0.7.6
fonttools         4.39.4
h5py              3.11.0
iso8601           1.1.0
Jinja2            3.1.2
joblib            1.2.0
kaleido           0.2.1
kiwisolver        1.4.4
lib_pod5          0.3.11
libsass           0.23.0
mappy             2.26
MarkupSafe        2.1.3
matplotlib        3.7.1
more-itertools    9.1.0
NanoComp          1.23.1
NanoFilt          2.8.0
nanoget           1.19.1
NanoLyse          1.2.1
nanomath          1.3.0
NanoPlot          1.42.0
nanoQC            0.9.4
NanoStat          1.6.0
natsort           8.4.0
numpy             2.0.0
packaging         23.1
pandas            2.2.2
Pillow            9.5.0
pip               24.1.2
plotly            5.22.0
pod5              0.3.11
polars            0.19.12
psutil            5.9.6
pyarrow           16.1.0
pycoQC            2.5.2
pydantic          1.10.17
pyparsing         3.0.9
pysam             0.22.1
python-dateutil   2.8.2
Python-Deprecated 1.1.0
pytz              2023.3
PyYAML            6.0.1
retrying          1.3.4
scikit-learn      1.5.1
scipy             1.14.0
seaborn           0.12.2
setuptools        65.5.0
sigfig            1.3.3
six               1.16.0
sortedcontainers  2.4.0
tenacity          8.2.2
threadpoolctl     3.1.0
tornado           6.3.3
toulligqc         2.7
tqdm              4.66.4
typing_extensions 4.12.2
tzdata            2023.3
vbz-h5py-plugin   1.0.1
xyzservices       2023.10.1
sklages commented 1 month ago

Putting ToulligQC in a fresh python-3.11 venv results in the same error.

sklages commented 1 month ago

I tried on a small private Linux box [1] with docker just as described with the same 2.4M file.

ToulligQC version 2.7
* Initialize extractors
* Start Toulligqc info extractor
* End of Toulligqc info extractor (done in 0m0.00s)
* Start Pod5 extractor
* End of Pod5 extractor (done in 0m0.00s)
* Start Basecaller sequencing summary extractor
  - Load sequencing summary file (54.93 MB used) in 0m2.20s
  - Extract info from sequencing summary file in 0m7.98s
  - Creation of image "Read count histogram" in 0m0.13s
  - Creation of image "Distribution of read lengths" in 0m1.29s
  - Creation of image "Yield plot through time" in 0m0.73s
  - Creation of image "PHRED score distribution" in 0m1.72s
  - Creation of image "PHRED score density distribution" in 0m0.37s
  - Creation of image "Channel occupancy of the flowcell" in 0m0.30s
  - Creation of image "Correlation between read length and PHRED score" in 0m0.67s
  - Creation of image "Read length over time" in 0m1.46s
  - Creation of image "PHRED score over time" in 0m1.69s
  - Creation of image "Translocation speed" in 0m1.73s
  - Creation of image "Pass barcoded reads distribution" in 0m0.03s
  - Creation of image "Fail barcoded reads distribution" in 0m0.02s
  - Creation of image "Read size distribution for barcodes" in 0m1.27s
  - Creation of image "PHRED score distribution for barcodes" in 0m1.20s
* End of Basecaller sequencing summary extractor (done in 0m22.79s)
* Write HTML report
Traceback (most recent call last):
  File "/usr/local/bin/toulligqc", line 33, in <module>
    sys.exit(load_entry_point('toulligqc==2.7', 'console_scripts', 'toulligqc')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/toulligqc-2.7-py3.12.egg/toulligqc/toulligqc.py", line 436, in main
    html_report_generator.html_report(config_dictionary, result_dict, graphs)
  File "/usr/local/lib/python3.12/dist-packages/toulligqc-2.7-py3.12.egg/toulligqc/html_report_generator.py", line 69, in html_report
    f = open(config_dictionary['html_report_path'], 'w')
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: 'MySAMPLE.TQC-Report.html'

No idea where the docker run wants to write its data. But in general, the parsing and creating of stats/plots seems to work with docker.

On the same linux box I pip-installed toulligqc (venv, Python 3.12.4) resulting in the same error as described initially.

[1] Manjaro Linux, 16G (7G free), Intel i7-10700T, root access

alihamraoui commented 1 month ago

Thanks, @sklages, for all the details.

Make sure you use -u $(id -u):$(id -g) when you run Docker to maintain the same user permissions, and using an absolute path will be better to know where to save your report.

I think this issue is related to the newer versions of Numpy and Pandas.

I'm trying to reproduce the error with my data.

In the meantime, try installing pandas==2.1.4 and numpy==1.26.4 with pip.

I hope that will solve the problem.

Best regards, Ali

sklages commented 1 month ago

Indeed it works like a charm with

<..>
numpy             1.26.4
pandas            2.1.4
<..>
ToulligQC version 2.7
* Initialize extractors
* Start Toulligqc info extractor
* End of Toulligqc info extractor (done in 0m0.00s)
* Start Pod5 extractor
* End of Pod5 extractor (done in 0m0.00s)
* Start Basecaller sequencing summary extractor
  - Load sequencing summary file (530.44 MB used) in 0m54.03s
  - Extract info from sequencing summary file in 1m11.55s
  - Creation of image "Read count histogram" in 0m0.55s
  - Creation of image "Distribution of read lengths" in 0m13.23s
  - Creation of image "Yield plot through time" in 0m6.01s
  - Creation of image "PHRED score distribution" in 0m18.64s
  - Creation of image "PHRED score density distribution" in 0m3.69s
  - Creation of image "Channel occupancy of the flowcell" in 0m1.41s
  - Creation of image "Correlation between read length and PHRED score" in 0m5.16s
  - Creation of image "Read length over time" in 0m10.83s
  - Creation of image "PHRED score over time" in 0m13.58s
  - Creation of image "Translocation speed" in 0m13.74s
  - Creation of image "Pass barcoded reads distribution" in 0m0.05s
  - Creation of image "Fail barcoded reads distribution" in 0m0.03s
  - Creation of image "Read size distribution for barcodes" in 0m14.96s
  - Creation of image "PHRED score distribution for barcodes" in 0m13.92s
* End of Basecaller sequencing summary extractor (done in 4m1.39s)
* Write HTML report
* Write statistics files
* End of the QC extractor (done in 4m1.77s)

So for now the problem is somewhat solved by simply downgrading numpy/pandas.

Thank you for the hint (and a very nice piece of software) :-)