GenomiqueENS / toulligQC

A post sequencing QC tool for Oxford Nanopore sequencers
Other
82 stars 7 forks source link

Error during fastq parsing #30

Closed jsabban closed 2 weeks ago

jsabban commented 3 weeks ago

Hi !

I have an error when I use a FASTQ file as input, I do not understand why... I use the docker image from singularity. I ran that :

singularity run \
src.sif toulligqc \
-a sequencing_summary.txt \
--output-directory output \
-p pass_barcode01.pod5 \
-q pass_barcode01.fastq \
-l 'barcode01,barcode02,barcode03'

And the output is :

ToulligQC version 2.7
* Initialize extractors
* Start Toulligqc info extractor
* End of Toulligqc info extractor (done in 0m0.00s)
* Start Pod5 extractor
* End of Pod5 extractor (done in 0m0.07s)
* Start fastq extractor
Processed: 4000read [00:02, 1843.16read/s]
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pandas/core/internals/construction.py", line 939, in _finalize_columns_and_data
    columns = _validate_or_indexify_columns(contents, columns)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/internals/construction.py", line 986, in _validate_or_indexify_columns
    raise AssertionError(
AssertionError: 4 columns passed, passed data had 3 columns

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/toulligqc", line 33, in <module>
    sys.exit(load_entry_point('toulligqc==2.7', 'console_scripts', 'toulligqc')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/toulligqc-2.7-py3.12.egg/toulligqc/toulligqc.py", line 422, in main
    extractor.init()
  File "/usr/local/lib/python3.12/dist-packages/toulligqc-2.7-py3.12.egg/toulligqc/fastq_extractor.py", line 60, in init
    self.dataframe_1d = self._load_fastq_data()
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/toulligqc-2.7-py3.12.egg/toulligqc/fastq_extractor.py", line 264, in _load_fastq_data
    fq_df = pd.DataFrame(fq_df, columns=columns)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 806, in __init__
    arrays, columns, index = nested_data_to_arrays(
                             ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/internals/construction.py", line 520, in nested_data_to_arrays
    arrays, columns = to_arrays(data, columns, dtype=dtype)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/internals/construction.py", line 845, in to_arrays
    content, columns = _finalize_columns_and_data(arr, columns, dtype)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/pandas/core/internals/construction.py", line 942, in _finalize_columns_and_data
    raise ValueError(err) from err
ValueError: 4 columns passed, passed data had 3 columns

Someone can help me, please ?

alihamraoui commented 3 weeks ago

Hi @jsabban,

It seems there might be an exception issue with your FASTQ file. To help find the problem, could you please provide the first 4 lines of your FASTQ file?

Ali

jsabban commented 3 weeks ago

Hi @alihamraoui , here are the first 4 lines of my FASTQ file :

@1a790b27-c54a-44b2-b6bb-0066afaf968e runid=6b2d611263610f8f4f521413ba294547e26a6308 read=42 ch=391 start_time=2024-04-16T16:37:38.785785+02:00 flow_cell_id=FAX94246 protocol_group_id=EMPHY sample_id=EMPHY-MLTPXreq-T16-L20-NBD114-24 barcode=barcode01 barcode_alias=barcode01 parent_read_id=1a790b27-c54a-44b2-b6bb-0066afaf968e basecall_model_version_id=dna_r10.4.1_e8.2_400bps_sup@v4.2.0
CCAATTACGTCGTTGTAGTCCAGCAAATACGTTTGTCACACAAACTTCATATTCTGGGCAACTCGGAGAGCGACTTAATGAAACATTTAAAAGATATAAACTACAATGGAAAATACACTTGCACTCATTGAAAACCCTACTCAAGGGCTTAAAACATTATCCGGAACAATTAGTTGGGAAGCTTTTAAACAAAACTGCTTTAAAGGATTTACAACAACGTACTGCACTCTTTACCAGTTGGTATAAAGTCGAATGTTCTGCCAACTTAAACCAGACCAACGTAGTTTAACGAATCGTTACGGGTATAATCCAAACTATCAGCAAACTGGGTCAAGTTCCATGAACCGATTTAAAGATCGTATTAAAGCCCGATTACGCGAATCTACTCACCAAGCACACCCGCTTTATGCAATGTCTGGTTCAATCCGTGGAACCTTTGGAAGCCGTGGGTATGAAACACGACTTCAACGTCCGCTGACTGGTCCTGTAACACAAATGAGTCAGGAATTTTTAAATCTAGTTTAAGTAAATTTCTAAGGGTATCCCATTTTATTGGTACCCCTAGAAATTTTTATATTTAAACATCATATTGTTTTGCGATATACTGGTAGAAAGCTATAATATAAACGAGCATCGTGGATAATAATGGAACCGTTATTAGACGAGAGCTTTGAAAGAAATGTTGAGGATTTCATATGTCAGGATCTACAGGAGAACGTCCTTTTAGTGACATCGTTACTAGTATTCGTTACTGGGTAATTCACAGTGTTACAATTCCGTCACTCTTTATTGCAGGATGGCTTTTTGTAAGTACTGGTTTAGCTTACGATGTATTTGGAACACCGCGTCCAAACGAATACTTTACGGATA
+
EGSSIHGJEDFGSKIIISSRIHILKLKSJOCEFKJMFHLGHEGFEFGEFS<<;;;;FHQHGJG66200588----/--/(((((7520))78:988899:CDGIENGILSKRKLSHLFBC@CLJFKSKMISNSSDABABJMISSSKSSMIGMJIIHJHIJGA56EJE>=77+***+11*54000/*.34@BJSOIINGSLMHJKKKOGMSLFCCCCGSIGSSISGOHFKINLFGGMMLIHSKSS11111=44445C64444DDNNSIISNIHLKJSSLKMMISHFJSLJMOIOKHCJHEG.--.HMBBA/-)))'''''13:;HSIJKLSNJKRKIKPLEEISSHMKISOSFIKSOKMSSKKHISSKSHOLJE>,,,,,FKOEGIHFHFNLMSOSSLNIJKELQMSISOKOSJSHMSGGEL==>=>FIIKSSSJRQMLIHSKJFSSKJMOSLJKKSOS=<<<<CBFSNGOSGSHJGLILSIGSRSSKKOSSKKMSNJJNSSKEJKHLIOLSHKKSP000000FGLSEKHJJJKRJSIRSKKPSHKSINLQSSSMOJJG<G?)<)))DC@EGIILNPFKSSSSKGSGSILSSENSSKSSGFHGNJSHNGGI:::::>QSFIHKSKSMLKKSHJKIJKSHSLSLHSSHGSJJILGSHKJSJJNSIFABBFFECD<??>1:;>@EGEPOJLKMLSNSFSIJS55655644333AAILKPKKISJHPSSLSSMSNHSMPSMGDGDCDKGF22222<<<<@JMSMHJSSIJMSKSNSJJSSSMHSMSOKRSMKD@@@@@SKPGSNRSSSKSIKGGJSSSSDDCDFJK@?@?@CFGFSLGM<77DDDGHJFENSHGEEJSEBDG@CB@>70.,)'&
alihamraoui commented 3 weeks ago

Hi @jsabban,

Thank you for reporting this issue.

I noticed that in the newer version of FASTQ, the sample_id flag is used instead of sampleid , which seems to be part of the problem.

I’ve addressed this issue in commit ab764669c0b21ba09d8aa0fc65207675d25b4a4c.

A new version (2.7.1) will be available tomorrow!

Thank you, Ali

jsabban commented 3 weeks ago

Whoa, so quick ! Thank you for the fix 😃