pipeline_readqc.py fails for `ValueError: table is empty` with FastQC tables

kevinrue commented 5 years ago

Hi,

The following statement fails:

    zcat fastqc_overrepresented_sequences.tsv.gz | python -m cgatcore.csv2db  --retry  --database-url=sqlite:///./csvdb --add-index=track --table=fastqc_overrepresented_sequences > fastqc_overrepresented_sequences.load
    -----------------------------------------)' raised in ...
       Task = def loadFastQC(...):
       Job  = [fastqc_overrepresented_sequences.tsv.gz -> fastqc_overrepresented_sequences.load]

    Traceback (most recent call last):
      File "/gfs/devel/kralbrecht/cgatflow/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 748, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/gfs/devel/kralbrecht/cgatflow/conda-install/envs/cgat-f/lib/python3.6/site-packages/ruffus/task.py", line 566, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/gfs/devel/kralbrecht/cgatflow/cgat-flow/cgatpipelines/tools/pipeline_readqc.py", line 400, in loadFastQC
        P.load(infile, outfile, options="--add-index=track")
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/pipeline/database.py", line 190, in load
        run(statement)
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/pipeline/execution.py", line 1335, in run
        benchmark_data = r.run(statement_list)
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/pipeline/execution.py", line 1124, in run
        (-process.returncode, stderr, statement))
    OSError: ---------------------------------------
    Child was terminated by signal -1:
    The stderr was:
    /etc/bashrc: line 12: PS1: unbound variable
    Traceback (most recent call last):
      File "/gfs/devel/kralbrecht/cgatflow/conda-install/envs/cgat-f/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/gfs/devel/kralbrecht/cgatflow/conda-install/envs/cgat-f/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/csv2db.py", line 341, in <module>
        sys.exit(main(sys.argv))
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/csv2db.py", line 335, in main
        run(infile, options)
      File "/gfs/devel/kralbrecht/cgatflow/cgat-core/cgatcore/csv2db.py", line 115, in run
        raise ValueError("table is empty")
    ValueError: table is empty

    zcat fastqc_overrepresented_sequences.tsv.gz | python -m cgatcore.csv2db  --retry  --database-url=sqlite:///./csvdb --add-index=track --table=fastqc_overrepresented_sequences > fastqc_overrepresented_sequences.load
    -----------------------------------------

As an attempt to debug, I’ve just done the whole:

Fresh terminal
Git pull the master branch of cgat-core/apps/flow within the cgatflow installation (they were all up-to-date)
python setup.py develop in each folder
source "../cgatflow/conda-install/etc/profile.d/conda.sh"
conda activate base
conda activate cgat-f
python /gfs/devel/kralbrecht/cgatflow/cgat-flow/cgatpipelines/tools/pipeline_readqc.py make full

Same error.

For the record:

$ zcat fastqc_overrepresented_sequences.tsv.gz
track   Duplication Level       Percentage of deduplicated      Percentage of total     
(end of file)

So one should be able to replicate the issue by creating a dummy file that only contains the header above, and run (in the cgat-f conda environment):

$ zcat fastqc_overrepresented_sequences.tsv.gz | python -m cgatcore.csv2db  --retry  --database-url=sqlite:///./csvdb --add-index=track --table=fastqc_overrepresented_sequences

Best, kevin

Acribbs commented 5 years ago

I can recreate the issue and it indeed relates to an empty file, I have added a check for empty files. here is the branch: https://github.com/cgat-developers/cgat-flow/tree/AC-fix-table-load. Can you check to see if you have any issues. I think the touch may have to be replaced with a sqlite command to create an empty table but I will do this once you have checked that the issue is fixed

kevinrue commented 5 years ago

Thanks @Acribbs In short: it seems to work. In long: loadFastQC completes, although as your pointed out, no table is generated, which may cause an issue downstream. I'll run the rest of the pipeline and report back here.

Again, as you said, it probably just needs an sqlite statement that makes a table with the right header but no record.

Cheers!

Acribbs commented 5 years ago

no worries @kevinrue, I will add the sqlite command then if it passing and then push.

Acribbs commented 5 years ago

@kevinrue creating empty tables in sqlite isn't supported, do you have a suggestion as to what I can add as a filler?

kevinrue commented 5 years ago

Thinking out loud, here is a sample record from a data set that has a non empty table:

sqlite> SELECT * FROM fastqc_overrepresented_sequences limit 1;
track|Count|Duplication Level|Percentage|Percentage of deduplicated|Percentage of total|Possible Source|Sequence
Benoist-GSE92597-Aire-ChIP-IgG-IlluminaMiSeq-GSM2433236-SRR5122567|27945||0.267658800112753|||No Hit|ACTTCCAGGGATTTATAAGCCGATGACGTCATAACATCCCTGACCCTTTA

Can we make up a dummy record that has NAs everywhere? If variable typing is any important, we can put "na" and 0 in the right places, e.g.

track|Count|Duplication Level|Percentage|Percentage of deduplicated|Percentage of total|Possible Source|Sequence
na|0||0|||No Hit|na

?

Not useful: From a brief look at the FastQC HTML report, when there are no OR sequences, they replace the table by the statement "No overrepresented sequences".

EDIT: what I suggest is having a dummy table packaged in the CGAT pipeline docs, that mimics the output of FastQC overrepresented sequences but only has a single row with NAs and 0s everywhere, and would be served to the ... | python -m cgatcore.csv2db --retry --database-url=sqlite:///./csvdb --add-index=track --table=fastqc_overrepresented_sequences when fastqc_overrepresented_sequences.tsv.gz is empty

Acribbs commented 5 years ago

Hi @kevinrue can you have a look at the change I made, im a little worries that we might get database locks because of the ruffus sending two statements at the same time. if you see this issue then maybe we may have to have think about how to overcome this

kevinrue commented 5 years ago

Yep. Sorry for the delay: I just tried and it works: the table appears in csvdb and the dummy row is there too. No issue about database lock as far as I can see.

Thanks!

Acribbs commented 5 years ago

ok cool, thanks. If you experience a database lock going forward let me know and I will add a random pause to the statement

jscaber commented 5 years ago

Unfortunately the empty table command implemented in #70 does not seem to overwrite the table creating a "table already exists" if the pipeline is rerun.

jscaber commented 5 years ago

Fixed.

cgat-developers / cgat-flow

pipeline_readqc.py fails for `ValueError: table is empty` with FastQC tables #68