pipeline readqc fails to load FASTQC metrics of trimmed files?

kevinrue commented 6 years ago

Hi,

Having enabled Trimmomatic as a preprocessor in pipeline_readqc, I found that while trimmed-WTCHG_45714_249_2_sequence.short_fastqc.load appear in the processed.dir folder the csvdb tables associated with the FastQC metrics of those trimmed samples don't appear in the database. Note that the individual FastQC report show up in the export directory, and that the trimmed samples do show up in the 'general' csvdb tables basic_statistics, experiment_per_sequence_quality, and status_summary. It is really only the tables of individual FastQC metrics for the trimmed samples that don't show up in csvdb.

For a reproducible example, I have symlinked a couple of FASTQ files from the cgat/tests folder into a new empty working directory:

WTCHG_45714_249_1_sequence.short.fastq.gz -> [...]/cgat/tests/fastq2fastq.py/WTCHG_45714_249_1_sequence.short.fastq.gz
WTCHG_45714_249_2_sequence.short.fastq.gz -> [...]/cgat/tests/fastq2fastq.py/WTCHG_45714_249_2_sequence.short.fastq.gz

Following python [...]/pipeline_readqc.py config, I have edited the configuration with:

preprocessors=trimmomatic
contaminants_path=[...]/fastqc-0.11.4/Configuration/contaminant_list.txt
[fastq_screen]
run=0

And then I ran python [...]/pipeline_readqc.py config make full.

Note also that it doesn't make a difference if I first run the pipeline without any preprocessor, and then run it again enabling the trimmomatic preprocessor. Tables are still missing from csvdb while the pipeline target files show up.

Not sure if I stumbled onto a bug or if I am missing something obvious.

Thanks in advance! Kevin

Acribbs commented 6 years ago

Hi Kevin,

I can look into this tomorrow. Just for clarity, did you run python [...]/pipeline_readqc.py config make full or python [...]/pipeline_readqc.py make full ?

kevinrue commented 6 years ago

It was only a typo in my message above, due to a hasty copy-paste when I was opening the issue.

To run the pipeline, I did the correct python [...]/pipeline_readqc.py make full, without config.

Acribbs commented 6 years ago

@kevinrue are you able to post the contents of the trimmed-WTCHG_45714_249_2_sequence.short_fastqc.load file please.

kevinrue commented 6 years ago

Sorry, just out of lab meeting now.

It's an empty file. See P.touch(outfile) https://github.com/CGATOxford/CGATPipelines/blob/master/CGATPipelines/pipeline_readqc.py#L387

Acribbs commented 6 years ago

im going to hold off looking into this, once I have a working environment for the new chat-developers I will test this over there.

sebastian-luna-valero commented 6 years ago

Hi @kevinrue

Would you be able to test the fix in https://github.com/CGATOxford/CGATPipelines/pull/426 please?

Best regards, Sebastian

kevinrue commented 6 years ago

I'll pull up a data set and test ASAP. Thanks for looking into this!

sebastian-luna-valero commented 6 years ago

Thank you!

kevinrue commented 6 years ago

I think I've updated everything correctly, and still don't see the tables in the database, but I rushed it a little bit, so let me do it again calmly another time and I'll tell you for sure. Cheers

sebastian-luna-valero commented 6 years ago

The fix is on branch issue-424.

Running the reproducible example mentioned above works on our end, but would be great to confirm with other input data as well.

Many thanks!

kevinrue commented 6 years ago

Thanks. I've made sure to checkout the branch issue-424. I've even python setup.py develop, in case that could have made a difference.

Actually.. just looking at the code right now, I realise that your updates only affect buildExperimentReadQuality. That said, I do see experiment.dir/trimmed-curdlan_bm_lthsc_R1_per_sequence_quality.tsv appear. However, that wasn't the original point of this issue (I cannot even remember looking at the content of experiment.dir), in fact.

What I was hoping to see are trimmed-...fastqc_Adapter_Content tables in the csvdb, considering that runFastqc and loadFastqc are supposed to process the outputs of processReads as well as the raw FASTQ files. See https://github.com/CGATOxford/CGATPipelines/blob/issue-424/CGATPipelines/pipeline_readqc.py#L342

Let me know if you need more/clearer details about the issue I'm after.

sebastian-luna-valero commented 6 years ago

Thanks, Kevin.

Could you please run the code with the reproducible example above?

Here are the tables I get after the first pass of pipeline_readqc:

sqlite> .tables
WTCHG_45714_249_1_sequence_short_fastqc_Adapter_Content             
WTCHG_45714_249_1_sequence_short_fastqc_Kmer_Content                
WTCHG_45714_249_1_sequence_short_fastqc_Overrepresented_sequences   
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_N_content          
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_content   
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_quality   
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_GC_content     
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_quality_scores 
WTCHG_45714_249_1_sequence_short_fastqc_Per_tile_sequence_quality   
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Duplication_Levels 
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_1_sequence_short_fastqc_status                      
WTCHG_45714_249_2_sequence_short_fastqc_Adapter_Content             
WTCHG_45714_249_2_sequence_short_fastqc_Kmer_Content                
WTCHG_45714_249_2_sequence_short_fastqc_Overrepresented_sequences   
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_N_content          
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_content   
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_quality   
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_GC_content     
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_quality_scores 
WTCHG_45714_249_2_sequence_short_fastqc_Per_tile_sequence_quality   
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Duplication_Levels 
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_2_sequence_short_fastqc_status                      
basic_statistics_summary                                            
experiment_per_sequence_quality

Can you also post the output of grep Task pipeline.log

Best regards, Sebastian

kevinrue commented 6 years ago

Hi @sebastian-luna-valero

I have run python ... make full -v 5

I get the same tables as you:

sqlite> .tables
WTCHG_45714_249_1_sequence_short_fastqc_Adapter_Content             
WTCHG_45714_249_1_sequence_short_fastqc_Kmer_Content                
WTCHG_45714_249_1_sequence_short_fastqc_Overrepresented_sequences   
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_N_content          
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_content   
WTCHG_45714_249_1_sequence_short_fastqc_Per_base_sequence_quality   
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_GC_content     
WTCHG_45714_249_1_sequence_short_fastqc_Per_sequence_quality_scores 
WTCHG_45714_249_1_sequence_short_fastqc_Per_tile_sequence_quality   
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Duplication_Levels 
WTCHG_45714_249_1_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_1_sequence_short_fastqc_status                      
WTCHG_45714_249_2_sequence_short_fastqc_Adapter_Content             
WTCHG_45714_249_2_sequence_short_fastqc_Kmer_Content                
WTCHG_45714_249_2_sequence_short_fastqc_Overrepresented_sequences   
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_N_content          
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_content   
WTCHG_45714_249_2_sequence_short_fastqc_Per_base_sequence_quality   
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_GC_content     
WTCHG_45714_249_2_sequence_short_fastqc_Per_sequence_quality_scores 
WTCHG_45714_249_2_sequence_short_fastqc_Per_tile_sequence_quality   
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Duplication_Levels 
WTCHG_45714_249_2_sequence_short_fastqc_Sequence_Length_Distribution
WTCHG_45714_249_2_sequence_short_fastqc_status                      
basic_statistics_summary                                            
experiment_per_sequence_quality                                     
status_summary

The output of grep Task pipeline.log is

$ grep Task pipeline.log
2018-06-08 09:56:33,596 INFO task.pipeline_run.5661 Tasks which are up-to-date:
2018-06-08 09:56:33,597 INFO task.pipeline_run.5661 Task = 'unprocessReads'
2018-06-08 09:56:33,599 INFO task.pipeline_run.5666 Tasks which will be run:
2018-06-08 09:56:33,630 INFO task.log_at_level.256 Task enters queue = "mkdir('export')   before runFastqc "
2018-06-08 09:56:34,642 INFO task.log_at_level.256 Task enters queue = "mkdir('export/fastqc') #2   before runFastqc "
2018-06-08 09:56:35,654 INFO task.log_at_level.256 Task enters queue = "mkdir('reconciled.dir')   before reconcileReads "
2018-06-08 09:56:36,666 INFO task.log_at_level.256 Task enters queue = "mkdir('processed.dir')   before processReads "
2018-06-08 09:56:37,678 INFO task.log_at_level.256 Task enters queue = "mkdir('fasta.dir')   before aggregateAdaptors "
2018-06-08 09:56:38,690 INFO task.log_at_level.256 Task enters queue = "mkdir('experiment.dir')   before buildExperimentLevelReadQuality "
2018-06-08 09:56:39,703 INFO task.log_at_level.256 Task enters queue = "mkdir('export')   before runFastqScreen "
2018-06-08 09:56:40,715 INFO task.log_at_level.256 Task enters queue = "mkdir('export/fastq_screen') #2   before runFastqScreen "
2018-06-08 09:56:41,732 INFO task.log_at_level.256 Completed Task = "mkdir('export')   before runFastqc "
2018-06-08 09:56:41,733 INFO task.log_at_level.256 Completed Task = "mkdir('export/fastqc') #2   before runFastqc "
2018-06-08 09:56:41,735 INFO task.log_at_level.256 Completed Task = "mkdir('reconciled.dir')   before reconcileReads "
2018-06-08 09:56:41,736 INFO task.log_at_level.256 Completed Task = "mkdir('processed.dir')   before processReads "
2018-06-08 09:56:41,737 INFO task.log_at_level.256 Completed Task = "mkdir('fasta.dir')   before aggregateAdaptors "
2018-06-08 09:56:41,737 INFO task.log_at_level.256 Task enters queue = 'aggregateAdaptors'
2018-06-08 09:56:42,752 INFO task.log_at_level.256 Completed Task = "mkdir('experiment.dir')   before buildExperimentLevelReadQuality "
2018-06-08 09:56:42,753 INFO task.log_at_level.256 Completed Task = "mkdir('export')   before runFastqScreen "
2018-06-08 09:56:42,754 INFO task.log_at_level.256 Completed Task = "mkdir('export/fastq_screen') #2   before runFastqScreen "
2018-06-08 09:56:42,854 INFO task.log_at_level.256 Completed Task = 'aggregateAdaptors'
2018-06-08 09:56:42,854 INFO task.log_at_level.256 Task enters queue = 'processReads'
2018-06-08 09:56:54,000 INFO task.log_at_level.256 Completed Task = 'processReads'
2018-06-08 09:56:54,000 INFO task.log_at_level.256 Inactive Task = 'reconcileReads'
2018-06-08 09:56:54,001 INFO task.log_at_level.256 Task enters queue = 'runFastqc'
2018-06-08 09:56:55,022 INFO task.log_at_level.256 Inactive Task = 'runFastqScreen'
2018-06-08 09:57:05,187 INFO task.log_at_level.256 Completed Task = 'runFastqc'
2018-06-08 09:57:05,187 INFO task.log_at_level.256 Task enters queue = 'loadFastqc'
2018-06-08 09:57:06,207 INFO task.log_at_level.256 Task enters queue = 'buildFastQCSummaryStatus'
2018-06-08 09:57:07,232 INFO task.log_at_level.256 Completed Task = 'loadFastqc'
2018-06-08 09:57:07,232 INFO task.log_at_level.256 Task enters queue = 'buildFastQCSummaryBasicStatistics'
2018-06-08 09:57:08,244 INFO task.log_at_level.256 Task enters queue = 'buildExperimentLevelReadQuality'
2018-06-08 09:57:09,272 INFO task.log_at_level.256 Completed Task = 'buildFastQCSummaryStatus'
2018-06-08 09:57:09,274 INFO task.log_at_level.256 Completed Task = 'buildFastQCSummaryBasicStatistics'
2018-06-08 09:57:09,274 INFO task.log_at_level.256 Task enters queue = 'loadFastqcSummary'
2018-06-08 09:57:10,306 INFO task.log_at_level.256 Completed Task = 'buildExperimentLevelReadQuality'
2018-06-08 09:57:10,307 INFO task.log_at_level.256 Task enters queue = 'combineExperimentLevelReadQualities'
2018-06-08 09:57:14,611 INFO task.log_at_level.256 Completed Task = 'loadFastqcSummary'
2018-06-08 09:57:21,357 INFO task.log_at_level.256 Completed Task = 'combineExperimentLevelReadQualities'
2018-06-08 09:57:21,358 INFO task.log_at_level.256 Task enters queue = 'loadExperimentLevelReadQualities'
2018-06-08 09:57:23,720 INFO task.log_at_level.256 Completed Task = 'loadExperimentLevelReadQualities'
2018-06-08 09:57:23,720 INFO task.log_at_level.256 Task enters queue = 'full'
2018-06-08 09:57:24,733 INFO task.log_at_level.256 Completed Task = 'full'

Note that my point is that after enabling 'trimmomatic' as a preprocessor, I expect twice as many tables in csvdb with half of them named 'trimmed...'. However, whatever I do, the additional 'trimmed...' tables never show up. (I even tried calling python ... make full a second time without effect.

Is that normal?

sebastian-luna-valero commented 6 years ago

Thanks, Kevin.

I can reproduce the issue and I have tried to fix it in https://github.com/CGATOxford/CGATPipelines/pull/426/commits/b3f044c502720360cdc1c9365719cdeb3c071de7. Please pull and let me know how it goes.

Best regards, Sebastian

AndreasHeger commented 6 years ago

Hi all, I am just about to refactor this for single cell work in the cgatflow repos such that there will only be a single table for each metric, such as fastqc_Adapter_Content, containing all the data tracks. The table-per-track design in this pipeline (and in others) does not work very well for 100s of tracks. Bad design decision from me, apologies.

TomSmithCGAT commented 6 years ago

I think you might be being a bit hard on yourself there Andreas. Your design pre-dated single cell workflow by some way! 🔮

kevinrue commented 6 years ago

Hi @sebastian-luna-valero , Sorry for the delay. I just tried now on a 'real' project (i.e., not the cgat test files) and the 'trimmed-*' tables show up in the csvdb. As far as I'm concerned, the issue is resolved. I'll leave you the pleasure of closing the issue when you merge into master

Cheers!

sebastian-luna-valero commented 6 years ago

Thanks for confirming @kevinrue !

@AndreasHeger that's great news!

CGATOxford / CGATPipelines

pipeline readqc fails to load FASTQC metrics of trimmed files? #424