Closed kevinrue closed 6 years ago
Hi Kevin,
I can look into this tomorrow. Just for clarity, did you run python [...]/pipeline_readqc.py config make full
or python [...]/pipeline_readqc.py make full
?
It was only a typo in my message above, due to a hasty copy-paste when I was opening the issue.
To run the pipeline, I did the correct python [...]/pipeline_readqc.py make full
, without config
.
@kevinrue are you able to post the contents of the trimmed-WTCHG_45714_249_2_sequence.short_fastqc.load
file please.
Sorry, just out of lab meeting now.
It's an empty file. See P.touch(outfile)
https://github.com/CGATOxford/CGATPipelines/blob/master/CGATPipelines/pipeline_readqc.py#L387
im going to hold off looking into this, once I have a working environment for the new chat-developers I will test this over there.
Hi @kevinrue
Would you be able to test the fix in https://github.com/CGATOxford/CGATPipelines/pull/426 please?
Best regards, Sebastian
I'll pull up a data set and test ASAP. Thanks for looking into this!
Thank you!
I think I've updated everything correctly, and still don't see the tables in the database, but I rushed it a little bit, so let me do it again calmly another time and I'll tell you for sure. Cheers
The fix is on branch issue-424
.
Running the reproducible example mentioned above works on our end, but would be great to confirm with other input data as well.
Many thanks!
Thanks.
I've made sure to checkout the branch issue-424
.
I've even python setup.py develop
, in case that could have made a difference.
Actually.. just looking at the code right now, I realise that your updates only affect buildExperimentReadQuality
.
That said, I do see experiment.dir/trimmed-curdlan_bm_lthsc_R1_per_sequence_quality.tsv
appear.
However, that wasn't the original point of this issue (I cannot even remember looking at the content of experiment.dir
), in fact.
What I was hoping to see are trimmed-...fastqc_Adapter_Content
tables in the csvdb
, considering that runFastqc
and loadFastqc
are supposed to process the outputs of processReads
as well as the raw FASTQ files.
See https://github.com/CGATOxford/CGATPipelines/blob/issue-424/CGATPipelines/pipeline_readqc.py#L342
Let me know if you need more/clearer details about the issue I'm after.
Thanks, Kevin.
Could you please run the code with the reproducible example above?
Here are the tables I get after the first pass of pipeline_readqc
:
sqlite> .tables
WTCHG_45714_249_1_sequence_short_fastqc_Adapter_Content
WTCHG_45714_249_1_sequence_short_fastqc_Kmer_Content
WTCHG_45714_249_1_sequence_short_fastqc_Overrepresented_sequences
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_N_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_quality
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_GC_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_quality_scores
WTCHG_45714_249_1_sequence_short_fastqc_Per_tile_sequence_quality
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Duplication_Levels
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_1_sequence_short_fastqc_status
WTCHG_45714_249_2_sequence_short_fastqc_Adapter_Content
WTCHG_45714_249_2_sequence_short_fastqc_Kmer_Content
WTCHG_45714_249_2_sequence_short_fastqc_Overrepresented_sequences
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_N_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_quality
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_GC_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_quality_scores
WTCHG_45714_249_2_sequence_short_fastqc_Per_tile_sequence_quality
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Duplication_Levels
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_2_sequence_short_fastqc_status
basic_statistics_summary
experiment_per_sequence_quality
Can you also post the output of grep Task pipeline.log
Best regards, Sebastian
Hi @sebastian-luna-valero
I have run python ... make full -v 5
I get the same tables as you:
sqlite> .tables
WTCHG_45714_249_1_sequence_short_fastqc_Adapter_Content
WTCHG_45714_249_1_sequence_short_fastqc_Kmer_Content
WTCHG_45714_249_1_sequence_short_fastqc_Overrepresented_sequences
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_N_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_quality
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_GC_content
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_quality_scores
WTCHG_45714_249_1_sequence_short_fastqc_Per_tile_sequence_quality
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Duplication_Levels
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_1_sequence_short_fastqc_status
WTCHG_45714_249_2_sequence_short_fastqc_Adapter_Content
WTCHG_45714_249_2_sequence_short_fastqc_Kmer_Content
WTCHG_45714_249_2_sequence_short_fastqc_Overrepresented_sequences
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_N_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_quality
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_GC_content
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_quality_scores
WTCHG_45714_249_2_sequence_short_fastqc_Per_tile_sequence_quality
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Duplication_Levels
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_2_sequence_short_fastqc_status
basic_statistics_summary
experiment_per_sequence_quality
status_summary
The output of grep Task pipeline.log
is
$ grep Task pipeline.log
2018-06-08 09:56:33,596 INFO task.pipeline_run.5661 Tasks which are up-to-date:
2018-06-08 09:56:33,597 INFO task.pipeline_run.5661 Task = 'unprocessReads'
2018-06-08 09:56:33,599 INFO task.pipeline_run.5666 Tasks which will be run:
2018-06-08 09:56:33,630 INFO task.log_at_level.256 Task enters queue = "mkdir('export') before runFastqc "
2018-06-08 09:56:34,642 INFO task.log_at_level.256 Task enters queue = "mkdir('export/fastqc') #2 before runFastqc "
2018-06-08 09:56:35,654 INFO task.log_at_level.256 Task enters queue = "mkdir('reconciled.dir') before reconcileReads "
2018-06-08 09:56:36,666 INFO task.log_at_level.256 Task enters queue = "mkdir('processed.dir') before processReads "
2018-06-08 09:56:37,678 INFO task.log_at_level.256 Task enters queue = "mkdir('fasta.dir') before aggregateAdaptors "
2018-06-08 09:56:38,690 INFO task.log_at_level.256 Task enters queue = "mkdir('experiment.dir') before buildExperimentLevelReadQuality "
2018-06-08 09:56:39,703 INFO task.log_at_level.256 Task enters queue = "mkdir('export') before runFastqScreen "
2018-06-08 09:56:40,715 INFO task.log_at_level.256 Task enters queue = "mkdir('export/fastq_screen') #2 before runFastqScreen "
2018-06-08 09:56:41,732 INFO task.log_at_level.256 Completed Task = "mkdir('export') before runFastqc "
2018-06-08 09:56:41,733 INFO task.log_at_level.256 Completed Task = "mkdir('export/fastqc') #2 before runFastqc "
2018-06-08 09:56:41,735 INFO task.log_at_level.256 Completed Task = "mkdir('reconciled.dir') before reconcileReads "
2018-06-08 09:56:41,736 INFO task.log_at_level.256 Completed Task = "mkdir('processed.dir') before processReads "
2018-06-08 09:56:41,737 INFO task.log_at_level.256 Completed Task = "mkdir('fasta.dir') before aggregateAdaptors "
2018-06-08 09:56:41,737 INFO task.log_at_level.256 Task enters queue = 'aggregateAdaptors'
2018-06-08 09:56:42,752 INFO task.log_at_level.256 Completed Task = "mkdir('experiment.dir') before buildExperimentLevelReadQuality "
2018-06-08 09:56:42,753 INFO task.log_at_level.256 Completed Task = "mkdir('export') before runFastqScreen "
2018-06-08 09:56:42,754 INFO task.log_at_level.256 Completed Task = "mkdir('export/fastq_screen') #2 before runFastqScreen "
2018-06-08 09:56:42,854 INFO task.log_at_level.256 Completed Task = 'aggregateAdaptors'
2018-06-08 09:56:42,854 INFO task.log_at_level.256 Task enters queue = 'processReads'
2018-06-08 09:56:54,000 INFO task.log_at_level.256 Completed Task = 'processReads'
2018-06-08 09:56:54,000 INFO task.log_at_level.256 Inactive Task = 'reconcileReads'
2018-06-08 09:56:54,001 INFO task.log_at_level.256 Task enters queue = 'runFastqc'
2018-06-08 09:56:55,022 INFO task.log_at_level.256 Inactive Task = 'runFastqScreen'
2018-06-08 09:57:05,187 INFO task.log_at_level.256 Completed Task = 'runFastqc'
2018-06-08 09:57:05,187 INFO task.log_at_level.256 Task enters queue = 'loadFastqc'
2018-06-08 09:57:06,207 INFO task.log_at_level.256 Task enters queue = 'buildFastQCSummaryStatus'
2018-06-08 09:57:07,232 INFO task.log_at_level.256 Completed Task = 'loadFastqc'
2018-06-08 09:57:07,232 INFO task.log_at_level.256 Task enters queue = 'buildFastQCSummaryBasicStatistics'
2018-06-08 09:57:08,244 INFO task.log_at_level.256 Task enters queue = 'buildExperimentLevelReadQuality'
2018-06-08 09:57:09,272 INFO task.log_at_level.256 Completed Task = 'buildFastQCSummaryStatus'
2018-06-08 09:57:09,274 INFO task.log_at_level.256 Completed Task = 'buildFastQCSummaryBasicStatistics'
2018-06-08 09:57:09,274 INFO task.log_at_level.256 Task enters queue = 'loadFastqcSummary'
2018-06-08 09:57:10,306 INFO task.log_at_level.256 Completed Task = 'buildExperimentLevelReadQuality'
2018-06-08 09:57:10,307 INFO task.log_at_level.256 Task enters queue = 'combineExperimentLevelReadQualities'
2018-06-08 09:57:14,611 INFO task.log_at_level.256 Completed Task = 'loadFastqcSummary'
2018-06-08 09:57:21,357 INFO task.log_at_level.256 Completed Task = 'combineExperimentLevelReadQualities'
2018-06-08 09:57:21,358 INFO task.log_at_level.256 Task enters queue = 'loadExperimentLevelReadQualities'
2018-06-08 09:57:23,720 INFO task.log_at_level.256 Completed Task = 'loadExperimentLevelReadQualities'
2018-06-08 09:57:23,720 INFO task.log_at_level.256 Task enters queue = 'full'
2018-06-08 09:57:24,733 INFO task.log_at_level.256 Completed Task = 'full'
Note that my point is that after enabling 'trimmomatic' as a preprocessor, I expect twice as many tables in csvdb with half of them named 'trimmed...'.
However, whatever I do, the additional 'trimmed...' tables never show up. (I even tried calling python ... make full
a second time without effect.
Is that normal?
Thanks, Kevin.
I can reproduce the issue and I have tried to fix it in https://github.com/CGATOxford/CGATPipelines/pull/426/commits/b3f044c502720360cdc1c9365719cdeb3c071de7. Please pull and let me know how it goes.
Best regards, Sebastian
Hi all, I am just about to refactor this for single cell work in the cgatflow repos such that there will only be a single table for each metric, such as fastqc_Adapter_Content, containing all the data tracks. The table-per-track design in this pipeline (and in others) does not work very well for 100s of tracks. Bad design decision from me, apologies.
I think you might be being a bit hard on yourself there Andreas. Your design pre-dated single cell workflow by some way! 🔮
Hi @sebastian-luna-valero ,
Sorry for the delay. I just tried now on a 'real' project (i.e., not the cgat test files) and the 'trimmed-*' tables show up in the csvdb.
As far as I'm concerned, the issue is resolved. I'll leave you the pleasure of closing the issue when you merge into master
Cheers!
Thanks for confirming @kevinrue !
@AndreasHeger that's great news!
Hi,
Having enabled
Trimmomatic
as a preprocessor inpipeline_readqc
, I found that whiletrimmed-WTCHG_45714_249_2_sequence.short_fastqc.load
appear in theprocessed.dir
folder the csvdb tables associated with the FastQC metrics of those trimmed samples don't appear in the database. Note that the individual FastQC report show up in theexport
directory, and that the trimmed samples do show up in the 'general' csvdb tablesbasic_statistics
,experiment_per_sequence_quality
, andstatus_summary
. It is really only the tables of individual FastQC metrics for the trimmed samples that don't show up in csvdb.For a reproducible example, I have symlinked a couple of FASTQ files from the
cgat/tests
folder into a new empty working directory:Following
python [...]/pipeline_readqc.py config
, I have edited the configuration with:And then I ran
python [...]/pipeline_readqc.py config make full
.Note also that it doesn't make a difference if I first run the pipeline without any preprocessor, and then run it again enabling the
trimmomatic
preprocessor. Tables are still missing from csvdb while the pipeline target files show up.Not sure if I stumbled onto a bug or if I am missing something obvious.
Thanks in advance! Kevin