FredHutch / EChO

Enhanced Chromatin Occupancy (EChO)
GNU General Public License v2.0
12 stars 2 forks source link

R script warning messages #2

Closed rbronste closed 4 years ago

rbronste commented 5 years ago

In running EChO_1.0.sh I received a number of warnings which by the look of it seem to emanate from the R script also being run, I did get a results file with foci at the end of the run however the foci seem to be considerably smaller than we anticipated in terms of avg fragment distributions in the output file. Wondering if you have any thoughts on this? Thanks!

Here is just a small example of the warnings and messages I saw:

Warning messages:
1: In predLoess(object$y, object$x, newx, object$s, object$weights,  :
  pseudoinverse used at 100.92

2: In predLoess(object$y, object$x, newx, object$s, object$weights,  :
  neighborhood radius 31.915

3: In predLoess(object$y, object$x, newx, object$s, object$weights,  :
  reciprocal condition number  0

4: In predLoess(object$y, object$x, newx, object$s, object$weights,  :
  There are other near singularities as well. 9

alls: loess -> simpleLoess

In addition: There were 50 or more warnings (use warnings() to see the first 50)

Execution halted
mpmeers commented 5 years ago

Hi,

One possible source of these warnings is the presence of repeat fragments that have the same position and size, which causes the loess model to fail locally since it is essentially deriving a binomial function using the same point repeated several times. I wrote in a check to ensure that the number of unique rather than total fragments in the analysis region is sufficient, and in my experience repeat fragments do not usually materially affect the results. However, would you mind producing a sample of your output so I can get a sense of how small "smaller than anticipated" is? Also, how many total fragments are in your dataset? If you have high fragment depth (>20 million), it might be worth filtering out all repeat fragments and re-running the analysis to see if that makes a difference, perhaps on a single chromosome first just to test it. Let me know if any of this is helpful.

Best,

Mike

rbronste commented 5 years ago

Hey Mike,

Thanks for the response, so in this particular instance there are ~15 million fragments in the bedpe, here is a sample of the output, this is cut&run for a TF and also provided a density plot below of the output fragment distribution at called foci. In this instance the duplicates are not removed from bedpe however duplication rates in the data are quite low between 5-10%:

chr7    25395591    25395592    65.7747619207535    0.106414835114753   0.21
chr7    25395636    25395637    45.2533208268816    0.11867223997502    0.21
chr7    25395685    25395686    43.9466790925667    0.0961508605796454  0.21
chr7    25395696    25395697    36.1515243086032    0.192744819289088   0.21
chr7    25395711    25395712    64.2233041988669    0.0811129723185085  0.21
chr7    25395780    25395781    49.6520822221844    0.140317066198767   0.21
chr8    121935428   121935429   66.4573053696521    0.127061874990014   0.22
chr8    121935482   121935483   77.4792308208646    0.12010462939569    0.22
chr8    121935502   121935503   49.1767494212278    0.135643112914727   0.22
chr8    121935522   121935523   48.0653267904713    0.165575013658304   0.22

Also just to clarify, was basing the small fragment comment based entirely on the avg fragment sizes I saw for several TFs in your Fig.1

Rplot14

mpmeers commented 5 years ago

Hi,

That is a profile that I haven't really seen in my data. Do you have a sense of the input size of the original library, or the fragment size distribution of the mapped fragments? If it happens to be the case that nearly all your fragments are less than 150 bp, this profile might be expected, since the local minima can't be larger than the largest fragments. For most of the TFs I mapped, there were enough 200+ bp fragments present that larger average fragment sizes were still well represented in the EChO foci. For reference, in my test bed file of CTCF fragments mapping to chr1, the mapped fragments are distributed across a range of sub-, mono-, and polynucleosomal sizes, and the EChO foci partition mostly into sub-nucleosomal and mononucleosomal, as seen here.

CTCF.chr1.fragments-foci.pdf

In summary, it seems like EChO might be working fine for you, but that the input is somehow calling almost exclusively small fragment sites. However, if your fragment size distribution is as varied as what I've shown here, there might be something else going on.

Mike

rbronste commented 5 years ago

Thanks for the plot Mike, exploring what may be going on here as we also expect a broader domain and maybe multiple peaks. One other question, your figure 1D plots are just 4th column of the EChO foci output right (mean fragment size) or is it something else? Thanks again!

mpmeers commented 5 years ago

I think you're referring to Figure 3D in the Mol Cell paper (https://www.sciencedirect.com/science/article/pii/S1097276519303971), correct? If so, then yes, I just used the output of column 4. If you have any CUT&RUN data from one of those factors (e.g. H3K27me3 or H3K4me2) to compare with my EChO output, that might be a good place to start in determining whether the low fragment size you're seeing is due to some sort of artifact of library generation, or is actually biologically relevant for your factor. Let me know how it goes, and if I can help with anything else.

Mike

rbronste commented 4 years ago

Still working through this, wondering if longer digestion times would affect the effectiveness of EChO, for some TFs we are doing 90 min. Also one other figure question, about how you plotted Fig 3C, was not immediately obvious to me. Wanted to look at the data in that way. Thanks again!

mpmeers commented 4 years ago

Hi,

That's certainly a possibility. Though in the original eLife CUT&RUN paper it was established that digestion is relatively consistent out to 60 minutes, I'm not sure whether we ever tried longer than that, and we typically don't digest any longer than 30 minutes for any targets we use in the lab. It might be worth trying a digestion time course with some shorter digestion times to see if that makes a difference. Also, if you're getting overdigestion, you might see peaks at accessible regions in an IgG control experiment. Normally this isn't an issue with digestion up to 30 minutes, but it might be worth checking for your experiments.

For Figure 3C, I took a sample region from CTCF data and generated a fragment size vs. position data frame in R, then used ggplot with geom_point() to generate the scatterplot and geom_smooth(method="loess", span=n) to generate the loess curve, where "n" corresponded to the span indicated in the figure. However, when using EChO, it automatically selects the span based on an optimal balance between error and roughness of the curve when testing each individual locus (in this case 0.22 was "optimal"). Hope that's clear enough.

Let me know if you have any other questions, or whether you figure out your issue with small fragment sizes.

Mike

rbronste commented 4 years ago

Very helpful thanks! So for 3C, given the scale bar, you must have had hundreds of thousands of foci (black dots in scatter plot)in the EChO output bed file?

mpmeers commented 4 years ago

So each of those points in Figure 3C actually represents an individual fragment, plotted on a position (x) vs. fragment size (y) plot (the position value is essentially arbitrary here). The foci are the local minima that appear in the loess curves plotted over the points. For example, in the case of the optimal span (middle panel, span=0.22), there are only three foci: the two highlighted in red boxes, and a third in between them that isn't highlighted (there is another dip on the left side, but I don't include edge cases since the fragment depth is typically too sparse there). In a typical experiment, I can get anywhere from zero (not enough fragment depth to calculate) to 10+ foci per CUT&RUN peak tested, depending on the length of the peak and the fragment depth, meaning an EChO output file could contain tens of thousands of loci. Apologies for the confusion, and hopefully that clarifies it.

I actually thought of one other thing regarding use of EChO--in my paper, I was calculating EChO foci entirely from SEACR peak calls, which tend to be broad and naturally incorporate larger fragments on the flanks. So if you're using a different peak caller such as MACS2, for instance, it could be the case that the peaks you're calling EChO from are rather narrow and already contain majority small fragments, meaning the calculation of the curve will bias you in the small fragment direction. Not sure exactly how you processed your data, but this might be something to consider.

Mike

rbronste commented 4 years ago

Thats very clear thanks! One final question, the following script is not included in the package of scripts here but seems to be necessary in the code:

perl scripts/fragsize.matrix.pl <intersected .bed file> <output name>

I also get this error after running matrix:

Fatal error: cannot open file 'scripts/fragsize.avg.052918.R': No such file or directory

Also get this error, maybe something about how I am setting up my bed file for matrix run:

Error in data.frame(argsL$line, span, lo$s, out3$fit[index == 0], out3$se[index ==  : 
  arguments imply differing number of rows: 1, 0
Execution halted

Thanks again!

mpmeers commented 4 years ago

So I think what happened is that you're using scripts that I deposited as the supplementary files in my paper, and that particular script was inadvertently left out. I'll need to remedy that; however, if you download all the scripts here from this GitHub page and use them as a unit, you should have everything you need. I think that should solve the first two issues, and subsequent updates I've made to the scripts on GitHub may solve the third. Let me know how it goes.

Mike

rbronste commented 4 years ago

So unfortunately I did initially download everything from github and just re-downloaded the package and ran it again and get the same result. The foci scripts work fine and I get those results however when I try to run it as a matrix on a subset of the peaks with the addition of the single bp focus as the fourth column I get the error I referenced above.

The input looks as follows:

chr17   25919809    25920242    25920026    
chr11   43681694    43682000    43681847    
chr8    126666450   126666773   126666612   
chr9    120011133   120011350   120011242

And using the following command:

bash EChO_1.0.sh test.bed test.bedpe matrix mymatrix

These are the contents of mymatrix.EChO.matrix:

Fatal error: cannot open file 'scripts/fragsize.avg.052918.R': No such file or directory

rbronste commented 4 years ago

Im wondering if the scripts presuppose some user directory/permission structure that may not be there on my computing cluster - for where some of these intermediate files are deposited.

mpmeers commented 4 years ago

My apologies--I think I found the problem. In EChO.matrix_1.0.pl, I updated the call to the R script in the main loop, but there is another call outside the loop that I failed to update. I've now updated the master branch, so hopefully it will work for you now.

Mike

rbronste commented 4 years ago

Great thanks for fixing that! Would you be able to post an example of your input formatting that goes into the matrix step?

Asking because now when I run the above input through the updated package I get the following error:

Error in read.table(argsL$frame) : no lines available in input
Execution halted
mpmeers commented 4 years ago

Hi,

It looks like you're using the original region bed file as the first input--is that correct? In addition to the standard "chr", "start" and "end" in the first three columns, the "region bed file" input must contain a fourth column that denotes the matrix centering coordinate. This fourth column is required as a reference point around which to build the output matrix, and without it the matrix mode will fail. I'll describe how I called matrix mode in my paper, since I think I haven't made the documentation clear enough (I've updated the README to try to remedy this):

To generate the four-column input region file, I typically use bedtools intersect -wao between the region bed file and the "foci" mode output EChO.bed file that I've filtered for foci of interest (e.g. small fragment sites), then filter out any regions that don't overlap a focus, and print out a new bed file that contains the first three columns from the region bed file, and the "start" coordinate from the foci output bed file as the fourth column. This new file can then be used as the first input to EChO in "matrix" mode.

The output files include a running tally file with the suffix "spans.txt", which prints line numbers and span values in order to match with the lines from the original input file (since occasionally EChO will fail to produce a matrix line owing to lack of data), and an "EChO.matrix" output that contains 401 columns with single base-pair fragment size predictions in a 200 bp window on either side of your centering foci.

Apologies that this is a little unwieldy; I may update this the future to make it more user-friendly. Let me know if this works for you, or if you're still getting fatal errors.

Mike

rbronste commented 4 years ago

Hi Mike,

So I finally got it to work, thanks for all of your help, one final question I have is about plotting and how you set up your dataframe for that. I set it up with a base by base run through the peak in the first 3 columns corresponding to the foci/fragments that fall within that peak as the next columns. However when I plot it I get something a bit different from your paper, something like the following:


     chrom   start    stop frag1 frag2 frag3    frag4    frag5    frag6     frag7
1 chr1   5022835 5023172   500   500   500 196.2992 98.06866 68.83916  96.98746
2 chr1   5022836      NA   500   500   500 191.5615 97.03804 69.53623  97.94464
3 chr1   5022837      NA   500   500   500 186.9297 96.12657 70.76395  98.81860
4 chr1   5022838      NA   500   500   500 182.4043 95.43896 72.35936  99.59252
5 chr1   5022839      NA   500   500   500 177.9855 95.05744 74.15213 100.24955
6 chr1   5022840      NA   500   500   500 173.6739 94.94865 75.97190 100.77284
my_data %>% ggplot() + 
  geom_point(aes(x = start, y = frag1)) + 
  geom_point(aes(x = start, y = frag2)) + 
  geom_point(aes(x = start, y = frag3)) +
  scale_y_continuous(limits=c(0,200)) +
  xlim(5023050, 5023100) + xlab("Genomic Position") +
  geom_smooth(aes(x = start, y = frag1), method="loess", span=.5, color = "red", fill = "black") + 
  geom_smooth(aes(x = start, y = frag2), method="loess", span=.5, color = "red", fill = "black") + 
  geom_smooth(aes(x = start, y = frag3), method="loess", span=.5, color = "red", fill = "black") +
  theme_bw()

EChO_plot

Not sure why I have this linear progression Thanks!

mpmeers commented 4 years ago

Hi,

I'm a little confused as to what you're plotting here--are the black points the entries from the matrix file? The row entries from the matrix represent predicted average fragment size in a 200 bp window on either side of the centering coordinate (column four in the input file that I described in my previous note). Therefore, going base-by-base through the peak won't actually line up with those values, since the boundaries of the peak are likely to be far away from the boundaries of the window that was constructed by the matrix. If you want to generate a plot like what appears in my Figure 3C, I'd advise building a fragment centers vs size data frame for all the fragments that overlap the specific peak you're interested in and running the following code in R:

geom_point(aes(x=position, y=size)) + geom_smooth(method="loess", span=x)

Where "x" corresponds to the calculated span value from the foci output file (column 6) for that specific peak. For instance, the first line of the EChO foci output for one of my test files is as follows:

chr1 805051 805052 36.5854831559428 1.37886370870591 0.16

The corresponding peak from the original SEACR input file is as follows:

chr1 804242 807405 1727.68 2.93081 chr1:805088-805096

So I can use bedtools intersect -wao with the SEACR file and a fragment bed file to obtain all fragments that overlap the SEACR peak boundaries, then calculate the center and size of each fragment and use them for my data frame, and call geom_smooth with a span of 0.16. Does that clarify or am I misunderstanding what you're trying to do?

Mike