collect_tfs.nf bug fixes

daisybio / TF-Prioritizer

Bioinformatics pipeline to identify differentially active transcription factors between conditions using expression and epigenetic data

GNU General Public License v3.0

13 stars 0 forks source link

collect_tfs.nf bug fixes #70

Closed LeonHafner closed 6 months ago

LeonHafner commented 6 months ago

Current bugs:

[x] collect_tfs.nf produces no output. With the current MM10 test data we produce only a single file in the rankings channel: L1:L10_enhancers.ranking.tsv. Since the single file is not stored as a list in nextflow, the rankings.name.join() operation does not work and produces an empty string.
[x] the final transcription factor list contains the elements AvgPeakDistance and AvgPeakSize. This seems to come from the output of STARE (e.g. enhancers_P6_TF_Gene_Affinities.txt), which append three columns NumPeaks, AvgPeakDistance and AvgPeakSize to the matrix of gene-TF affinities.

LeonHafner commented 6 months ago

I'm still wondering why we suddenly only produce a single output file as input for collect_tfs.nf and not one for every pair of L1, L10, P6 and P13. I'm curious if you have any insights on that @nictru? If not, I’ll start looking into it for debugging.

nictru commented 6 months ago

The purpose of collect_tfs.nf is to create a single, non-pairing-specific list of all transcription factors which need to be included in the final report. Based on this list, data about the transcription factors can then be collected and prepared for the report. We might not even need this any more since fetching of additional data will mainly be handled dynamically by the report.

In earlier versions, we would fetch chip-atlas bed files, binding motif logos etc. based on this list.

nictru commented 6 months ago

Changes look good so far

LeonHafner commented 6 months ago

First line of the input files for collect_tfs.nf looks like that: sum mean q95 q99 median p-value rank dcg. The line was treated like a regular TF line. By skipping it we prevent "sum" from getting included into the TF list.

LeonHafner commented 6 months ago

I'm still wondering why we suddenly only produce a single output file as input for collect_tfs.nf and not one for every pair of L1, L10, P6 and P13. I'm curious if you have any insights on that @nictru? If not, I’ll start looking into it for debugging.

This was due to a mismatch of condition labels ("P6" vs. "p6") and a subsequent failed matching. Solved by adjusting the pipeline input file "bam_design2.tsv" for ChromHMM. No further changes necessary.