MalteThodberg / CAGEfightR

Analysis of Cap Analysis of Gene Expression (CAGE) data using Bioconductor
GNU General Public License v3.0
8 stars 2 forks source link

BC calculation stopped because some sites had negative pooled signal! #1

Open bkaczkowski opened 5 years ago

bkaczkowski commented 5 years ago

Hi Malte, I'm getting an error (see below) when using quickEnhancers function. The quickTSSs function worked without an issue. Best Bogumil

enhancers <- quickEnhancers(CTSSs)

  • Running calcTPM and calcPooled: Calculating library sizes... Calculating TPM...
MalteThodberg commented 5 years ago

Hello Bogumil, glad to see you are trying out CAGEfightR!

It looks like CAGEfightR caught an error trying to process the pooled signal. It's trying to avoid dividing by a negative number when calculating the balance for score for enhancer prediction.

It's hard for me to know what's going wrong without the actual files. I suspect it could be something with the input CTSSs leading to a pooled signal with negative values.

You could try running calcTPM and calcPooled separately and then check how the score column looks - hopefully all CTSSs should have a non-negative pooled TPM!

bkaczkowski commented 5 years ago

Yes looks like I have a problem with my bigwig files [we normally don't use this format around here much :) ]. I actually used CAGEr's function to do it [ exportCTSStoBedGraph(ce, values = "raw", format = "BigWig") ] but that didn't go well. Could you point me to a tested method to generate bigwigs that work well with your package (it would be nice to add to the tutorial too). Did you use bamCoverage from deeptools?

MalteThodberg commented 5 years ago

Yes, many people use BED-files for CTSSs, however BigWigs have the advantage that they store chromosome lengths + allow for random access and thereby parallel processing.

I'm working on a major update to CAGEfightR that will allow quantifyCTSSs to read bedGraph-files + checkCTSSs function that will check the input files for any inconsistencies.

If you have BED-files with CTSSs (e.g. from CAGEr), you can read them into a GRanges using rtracklayer::import, split them by strand, and then write them to BigWig-files using rtracklayer::export.

We had BAM-to-BigWig converter based on RSamTools/GenomicAlignments in the works, but it's hard to make a consistent function that handles the many variations of CAGE and other types of 5'-end data. We've had a couple of request for it already, we might add in a future update.

bkaczkowski commented 5 years ago

Ok, thanks I will try the option with rtracklayer::import / rtracklayer::export and will let you know how it went

bkaczkowski commented 5 years ago

Hello again, rtracklayer::import / rtracklayer::export did the trick.

christyvj commented 4 years ago

Hi Malte and bkaczkowski,

I too am having trouble generating bigwig files that work with quickEnhancers(). I have started with bam files (processed by a collaborator so I cannot get the raw fastq files) and have tried multiple ways of generating bigwig files to no avail. The method that has got me furthest is uploading the bam files into CAGEr and then exporting them as bedGraphs and using bedGraphToBigWig to generate the bigwig files. These files work with no errors for quickTSSs(), but I get the following error when running quickEnhancers():

Error in (function (PD, MD, PU, MU) : BC calculation stopped because some sites had negative pooled signal!

Some googling led me to this discussion and prompted me to run checkPooled() which gave me:

Error in checkPooled(CTSSs) : The supplied pooled signal is not valid!

I am new to CAGE data and relatively inexperienced with R. Can you suggest a work around for me? I looked into the suggestions of rtracklayer::import / rtracklayer::export, but can't seem to get that working either (but that is probably my poor R skills).

Any help would be greatly appreciated.

Thank you, Christy.

MalteThodberg commented 4 years ago

Hello Christy!

Exciting to see you are considering CAGEfightR for your analysis!

Problems with getting CTSSs correctly are unfortunately common. Last Bioconductor version I added some conversion functions: Did you try to convert BED-files (e.g. from CAGEr or whatever pipeline you are using for processing) with convertBED2BigWig?

christyvj commented 4 years ago

I did see documentation for this, but couldn't get it going on our cluster (function not recognised). I thought maybe it had been removed because it was experimental. Maybe the Bioconductor version we have is old. Will ask my cluster people and give it a go. Will let you know how it goes. Thanks for your quick reply!!

christyvj commented 4 years ago

Unfortunately, still not working :( . Here is the code I used in case you can see anything I have obviously done wrong. BedGraph files were createdfrom bam files imported into CAGEr and exported out as BedGraph (for each sample, one bedgraph for plus strand and one for minus strand).

converting bedgraph files to bigwig

Design<-read.table(file="designMatrix_bedGraph.txt",header=TRUE)

bg_plus <- system.file("extdata",Design$BedGraphPlus, package = "CAGEfightR") bg_minus <- system.file("extdata",Design$BedGraphMinus, package = "CAGEfightR") bw_plus <- gsub("bedGraph","bw",bg_plus) bw_minus <- gsub("bedGraph","bw",bg_minus)

genomeInfo <- SeqinfoForUCSCGenome("bosTau9")

convertBedGraph2BigWig(input=bg_plus, output=bw_plus,genome=genomeInfo) _Attempting to convert file: File1_plus.bedGraph Attempting to convert file: File2_plus.bedGraph Attempting to convert file: File3_plus.bedGraph … Attempting to convert file: File25_plus.bedGraph Warning message: In convertBedGraph2BigWig(input = bg_plus, output = bwplus, genome = genomeInfo) : Input bedGraph files do not all have the proper .bedGraph extension! Attempting import anyway...

convertBedGraph2BigWig(input=bg_minus, output=bw_minus,genome=genomeInfo) _Attempting to convert file: File1_minus.bedGraph Attempting to convert file: File2_minus.bedGraph Attempting to convert file: File3_minus.bedGraph … Attempting to convert file: File25_minus.bedGraph Warning message: In convertBedGraph2BigWig(input = bg_minus, output = bwminus, genome = genomeInfo) : Input bedGraph files do not all have the proper .bedGraph extension! Attempting import anyway...

using converted bigwig files

DesignBW<-read.table(file="designMatrix_bw.txt",header=TRUE) row.names(DesignBW)<-DesignBW$Name bw_plus <- system.file("extdata",DesignBW$BigWigPlus, package = "CAGEfightR") bw_minus <- system.file("extdata",DesignBW$BigWigMinus,package = "CAGEfightR") bw_plus <- BigWigFileList(bw_plus) bw_minus <- BigWigFileList(bw_minus) names(bw_plus) <- DesignBW$Name names(bw_minus) <- DesignBW$Name

CTSSs <- quantifyCTSSs(plusStrand=bw_plus,minusStrand=bw_minus,design=DesignBW,genome=genomeInfo) Checking design... Checking supplied genome compatibility... Iterating over 1 genomic tiles in 25 samples using 46 worker(s)... Importing CTSSs from plus strand... Importing CTSSs from minus strand... Merging strands... Formatting output... CTSS summary Number of samples: 25 Number of CTSSs: 92.95 millions Sparsity: 94.5 % Type of rowRanges: StitchedGPos Final object size: 2.19 GB

CTSSs <- CTSSs %>% calcTPM() %>% calcPooled() Calculating library sizes... Calculating TPM...

TCs <- quickTSSs(CTSSs) Using existing score column! Running clusterUnidirectionally: Splitting by strand... Slice-reduce to find clusters... Calculating statistics... Preparing output... Tag clustering summary: Width Count Percent Total 26300981 1e+02 % =1 21412423 8e+01 % =10 4830871 2e+01 % =100 57662 2e-01 % =1000 25 1e-04 % Running quantifyClusters: Finding overlaps... Aggregating within clusters...

BCs <- quickEnhancers(CTSSs) Using existing score column! Running clusterBidirectionally: Pre-filtering bidirectional candidate regions... Retaining for analysis: 98% Splitting by strand... Calculating windowed coverage on plus strand... Calculating windowed coverage on minus strand... Calculating balance score... Error in (function (PD, MD, PU, MU) : BC calculation stopped because some sites had negative pooled signal!

And that’s as far as I get. Still the same message about negative pooled signal. Is it safe to try the convertBAM2BigWig function?

MalteThodberg commented 4 years ago

This does indeed look a bit strange!

You could try using convertBAM2BigWig, however I haven't tested it very much yet + it's not super fast. So use at your own risk...

A few things:

For specifying file paths, you don't need the bw_plus <- system.file("extdata", ... , package = "CAGEfightR"), you can simply type out the path on your hard drive, e.g. "/path/to/my/bigwigs/sample1_plus.bw"

I don't remember exactly how CAGEr does it, but the file conversions seem to produce some warnings: Are you sure the files are properly formatted? You can try and run the checkCTSSs on a few of the files to see if they contain what you expect. If you just need to load the data into R without running the full CAGEfightR pipeline, you can simple use rtracklayer::import.

Hope this helps!

mirkocelii commented 4 years ago

Hello Malte I'me having a similar error for the people above when running:

BCs <- quickEnhancers(CTSSs) Error in (function (PD, MD, PU, MU) : BC calculation stopped because some sites had negative pooled signal!

my data were exported from cageR with these options:

exportCTSStoBedGraph(ce.rep, values = "raw", format = "BigWig")

I also tried to export them with, but still the same issue

exportCTSStoBedGraph(ce.rep, values = "normalized", format = "BigWig")

And I also tried to export the in bed format and covert them with rtracklayer as indicated above, but i stll have the same error. Did i do something wrong?

exportToBed(object = ce.rep, what = "tagClusters", qLow = 0.1, qUp = 0.9, oneFile = FALSE)

library("BSgenome.Hsapiens.UCSC.hg19")
bsg <- BSgenome.Hsapiens.UCSC.hg19 

gr = rtracklayer::import(  "myfile.tagClusters.qLow0.1_qUp0.9.bed" )

#add seqlengths because it was missing
seq_names = names( seqlengths(gr) )
seq_length = seqlengths(bsg) [ seq_names ]
seqlengths(gr)   =  seq_length

split(gr,  strand(gr))
gr.plus  <- gr[strand( gr) == "+"]
gr.minus <- gr[strand( gr) == "-"]

rtracklayer::export(gr.plus   , myfile.CTSS.RAW.plus.bw   , format = "BigWig")
rtracklayer::export(gr.minus, myfile.CTSS.RAW.minus.bw, format =   "BigWig")
MalteThodberg commented 4 years ago

The plot thickens! Would be nice to get to the bottom of this: Ideally CAGEr and CAGEfightR should be highly compatible.

There's a few things you could try to nail down the issue:

mirkocelii commented 4 years ago

Hi Malte, thanks for your suggestions. Do you mean exporting CAGEr data in BEDGRAPH file? there are 2 fuctions for exporting data

Initially I exported CTSS data on bigwig format, and later in bedgraph(see below)

BigWig format after importing them with quantifyCTSSs(), there are negative counts displayed by tail() function, is that expected?

exportCTSStoBedGraph(ce.rep, values = "raw" , format = "BigWig")
CTSSs = quantifyCTSSs(plusStrand = bw_plus, minusStrand = bw_minus, genome = seqinfo(bsg), design = df) 
assay(CTSSs, "counts") %>%  head

6 x 57 sparse Matrix of class "dgCMatrix"
   [[ suppressing 57 column names 'CN1_MB', 'CN1_MT01', 'CN1_MT02' ... ]]

[1,] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . .
[2,] . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
[3,] . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
[4,] . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 .
[5,] . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . 9 .
[6,] . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . 3 . 1 . . . . .
> assay(CTSSs, "counts") %>%  tail
6 x 57 sparse Matrix of class "dgCMatrix"
   [[ suppressing 57 column names 'CN1_MB', 'CN1_MT01', 'CN1_MT02' ... ]]

[19030497,] . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -1 . . . . . . . . .  . . . . . . . . . .  .
[19030498,] . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . -1 . . . . . . . . .  .
[19030499,] . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . . -1
[19030500,] . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . -1 . . . . . . . . .  .
[19030501,] . . . . -7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . .  . . . . . . . . . .  .
[19030502,] . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -1 . . . . . . . . .  . . . . . . . . . .  .
>

BedGraph format 1) So I tried to export CAGEr data in bedgraph files :**

exportCTSStoBedGraph(ce.rep, values = "raw", format = "bedGraph", oneFile = FALSE)

To convert bedgraph files to bigwig i tried 2 ways:

2a ) I converted each bedgraph file with convertBedGraph2BigWig() in bigwig and I have this warning message

for (i in bedGraph_files) 
{   j=gsub(".bedGraph",".bw",i) ;
    j=gsub("raw","Raw_converted",j) ;
    convertBedGraph2BigWig(i, j, genome=SeqinfoForUCSCGenome(genome))
    }

Attempting to convert file: my_file.raw.minus.bedGraph
Warning message:
In convertBedGraph2BigWig( Bedgraph_file , BigWig_file , genome = SeqinfoForUCSCGenome("hg38")) :
Input bedGraph files do not all have the proper .bedGraph extension! Attempting import anyway

2b ) I converted each bedgraph file in bigwig by importing/exporting the object into R with rtracklayer and I checked that the scores were all positive

for (i in bedGraph_files) 
{
    j=gsub(".bedGraph",".bw",i) ;
    j=gsub("raw","Raw_conv2",j) ;
    gr = rtracklayer::import( i, genome=SeqinfoForUCSCGenome("hg38") )
    sum(gr$score <0)
[1] 0
    rtracklayer::export( gr , j , format = "BigWig")
    }

The files produced with these 2 steps are identical to the original bigwig file exported from CAGEr

ll sampleA.CTSS*aw*plus*.bw
-rw-r--r-- 1 celiim  3748056 Sep  7 14:48 sampleA.CTSS.Raw_conv2.plus.bw
-rw-r--r-- 1 celiim  3748056 Sep  7 13:05 sampleA.CTSS.Raw_converted.plus.bw
-rw-r--r-- 1 celiim  3748056 Sep  7 11:29  sampleA.CTSS.raw.plus.bw

3 ) I run checkCTSS() on all the 3 bigwig dataset and I get a warning again, but since I have 2 separate files for each strand, I think i can ignore the warning, right?

checkCTSSs(as.character(my_file.minus.bw)

Attempting import of supplied filename...
[1] TRUE
Warning message:
In checkCTSSs(object) :
CTSSs are unstranded: CTSSs must be either on the + or - strand. This warning can be ignored if a BigWig/bedGraph file was supplied, in which case each strand is stored in a separate file

4 ) I ran CAGEfightR commands untill I got the same error

CTSSs = quantifyCTSSs(plusStrand = bw_plus ,minusStrand = bw_minus,  genome = seqinfo(bsg), design = df) 
CTSSs =  CTSSs %>%  calcTPM()
CTSSs =  CTSSs %>%  calcPooled()
TCs <- quickTSSs(CTSSs)       
TSSs =  TCs %>%   calcTPM() %>%  subsetBySupport(inputAssay="TPM",   unexpressed=1,  minSamples=3 )   
BCs <- quickEnhancers(CTSSs)  

Using existing score column!
Running clusterBidirectionally:
Pre-filtering bidirectional candidate regions...
Retaining for analysis: 75%
Splitting by strand...
Calculating windowed coverage on plus strand...
Calculating windowed coverage on minus strand...
Calculating balance score...
Error in (function (PD, MD, PU, MU)  :
  BC calculation stopped because some sites had negative pooled signal!
MalteThodberg commented 4 years ago

Aha - for libraries where you do see negative counts, did you try looking directly at the bedGraph files? Try and use rtracklayer::import() on the specific library, then use score() to extract the counts and see if any are negative.

mirkocelii commented 4 years ago

Dear Malte, I had checked already the score of a random CTSS.raw.plus.bedGraph file and it looked ok. After your comment I checked all of them and I found out that the CTSS.raw.minus.bedGraph files have negative all score values. So it must be a feature of the minus stranded values? Do I need to remove the minus of the gr object before exporting to a BigWig files?

MalteThodberg commented 4 years ago

Having negative values in the bedGraph will definitely disrupt CAGEfightR. I will see if I can add a new internal check to make sure the user is properly warned of this.

I assume CAGEr is flipping the values to make pretty genome browser files, with plus-strand tags counting upwards and minus-strand tags counting downwards (kinda like the Zenbu genome browser).

A quick workaround that might work: Build the RangedSummarizedExperiment of CTSSs, and then make a new assay of only absolute counts: e.g. assay(CTSSs, "counts") <- abs(assay(CTSSs, "counts")).

You could also modify the bedGraph files themselves, e.g. by using rtracklayer::import and then setting score(x) <- abs(score(x))

(Or you could ask Charles Plessy to add the option of not flipping scores in the bedGraph files in CAGEr 😉)

christyvj commented 4 years ago

Hi Malte,

I've been AWOL while attending to other work priorities. Just got back to this this morning and tried your latest solution, assay(CTSSs, "counts") <- abs(assay(CTSSs, "counts")) after quantifyCTSSs()

It worked!!!

Just wanted to say thank you for your time in helping solve my problem (and thank you to mirkocelii too for keeping the ball rolling in my absence!).

Fingers crossed it's smooth sailing from here.

Cheers, Christy

BCs <- quickEnhancers(CTSSs) Using existing score column!

- Running clusterBidirectionally: Pre-filtering bidirectional candidate regions... Retaining for analysis: 98% Splitting by strand... Calculating windowed coverage on plus strand... Calculating windowed coverage on minus strand... Calculating balance score... Slice-reduce to find bidirectional clusters... Calculating statistics... Preparing output... # Bidirectional clustering summary: Number of bidirectional clusters: 903907 Maximum balance score: 1 Minimum balance score: 0.950000058268218 Maximum width: 2370 Minimum width: 401

- Running subsetByBidirectionality: Calculating bidirectionality... Subsetting... Removed 489760 out of 903907 regions (54.2%)

- Running quantifyClusters: Finding overlaps... Aggregating within clusters...

MalteThodberg commented 4 years ago

Glad it worked! I would do a few sanity check that the abs(counts) workaround is working as intended - CAGEfightR isn't technically built to handle negative counts, but I reckon it shouldn't make a difference. I would probably advice on fixing the bedGraph files themselves in the long term.

I will leave this issue open until I have added a check for negative counts in checkCTSSs!

mirkocelii commented 4 years ago

Dear Malte, thanks for your help. By removing the "-" to minus strand files, and it works !

I've already stressed enough Charles Plessy for CAGEr too, maybe you can add this instruction it in the vignette?

MalteThodberg commented 4 years ago

I will look into it, perhaps I could add some extra code to convertBedGraph2BigWig to handle this case