Tarela / SELMA

Intrinsic bias estimation for improved analysis of bulk and single-cell chromatin accessibility profiles using SELMA
GNU Affero General Public License v3.0
3 stars 3 forks source link

Bias matrices for other experiments #6

Open cahn20 opened 10 months ago

cahn20 commented 10 months ago

Hello,

It looks like SELMA has built-in bias score matrices for DNaseI and Tn5, each called by the '-t DATATYPE' flag. Is it possible for the user to generate/use custom bias matrices for other enzymes as well? (e.g. MNase). Thanks in advance.

Best, Chris

Tarela commented 10 months ago

Hi Chris, Sorry for replying late. Are you still suffering from the cutSum issue? I double-checked the code with the testing data (provided in this github repo), and other real data before, and everything looks good. I just reprocessed the data with the cmd you pasted, and it looks good in my environment. If you still suffer from that, could you please rerun it with the --keep-tmp parameter and let me know what the folder structure looks like (e.g., file sizes of all the files in main/tmp folders).

For the DATATYPE question, 1. using MNase/benzonase/cyanse or other enzymes for bias matrix is a good idea, and I've tried them before. Some of them have a significant/enzyme-specific bias pattern. SELMA used DNase and Tn5 only because these two are most popular in transcriptional regulation studies (DNase-seq/ATAC-seq). Does your MNase-seq data have enough chrM reads? If so, you can just set the "--bias chrM" parameter to let SELMA estimate the bias from chrM reads. If not, We can further discuss an alternative solution, and I can also make an updated version specific to your issue if necessary.

I hope that helps. Feel free to provide more information or request, and I'd be very happy to help.
Best Shawn

cahn20 commented 10 months ago

I used the --keeptmp flag and I'm still getting the same error. I don't see a tmp folder, so I just have the contents of the output folder below:

[ 416] . ├── [ 11K] hg38.sizes ├── [ 0] testsc_chrM.bed ├── [1.4G] testsc_chromatin.bed ├── [1.4G] testsc_highQcellReads.bed ├── [ 46] testsc_peakFeatures.txt ├── [ 18K] testsc_peakXcellMat.txt ├── [1.1K] testsc_progress.log ├── [ 0] testsc_scOVcleavage.bed ├── [241K] testsc_summitEXT.bed ├── [ 95K] testsc_tmpSCpeaks.bed └── [ 59M] testsc_tmpSCreads.bed

No need to rush though, as I probably won't be using sc mode.

As for using other enzymes, I don't have any chrM reads available. Is it not feasible to create duplicates of my fragments and rename them as chrM as you've mentioned?

cahn20 commented 10 months ago

I actually do have chrM reads available, sorry for the misinformation. Also, is there a way to obtain "bias corrected" bw tracks?

Tarela commented 10 months ago

Hi Chris,

I am glad you found the chrM reads. You can simply run the code and generate chrMbiasMatrix from two samples (if you have one) and compare the correlation coefficient between the estimated bias from the two samples if they should have a good correlation (e.g., a correlated scatter plot with a 0.8 or higher correlation coefficient). BTW according to my experience on MNase from my MNase-seq data analysis many years ago, 10-mer (k=10) might not be the best option. I believe 6-mer or 8-mer might be better (try 6-mer, especially when your chrM reads are not too many, like < 50k).

I believe running SELMA bulk mode with --keeptmp parameter will output the biasCorrected bigwig file (with a similar filename). It would be great if you could try that as well. I'll also confirm that later and let you know.

Hope that helps. Best Shawn

On Mon, Nov 20, 2023 at 8:35 AM Chris @.***> wrote:

I actually do have chrM reads available, sorry for the misinformation. Also, is there a way to obtain "bias corrected" bw tracks?

— Reply to this email directly, view it on GitHub https://github.com/Tarela/SELMA/issues/6#issuecomment-1819077050, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6QQAEMQNITY7YTW74ZDYLYFNMDBAVCNFSM6AAAAAA7OY7B4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJZGA3TOMBVGA . You are receiving this because you commented.Message ID: @.***>

-- Sheng'en Shawn Hu, PhD Postdoctoral research associate University of Virginia

cahn20 commented 9 months ago

Thanks for the reply Shawn.

I have around 10k reads on chrM so I guess it might not be enough, but as you've mentioned I still tried running SELMA with the --bias chrM parameter and I'm getting an error saying that there is no naked DNA bias matrix, cannot estimate bias. The -t parameter does not matter here, right?

The command I used and the output log are as follows:

Command: SELMA -m bulk -i fragment.bed.gz -p peak.bed -g hg38 -f PE -o test_chrM -t ATAC --bias chrM -s hg38.2bit --keeptmp --kmer 6

Output Log: /usr/local/bin/SELMA:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import('pkg_resources').run_script('SELMA==1.0', 'SELMA') process log is printed into test_chrM_progress.log Start SELMA bulk Step0: check input Data and parameters detected bed file format is SE detected bed file format SE is different from the input file format PE. Please double check the parameter -f(format) bedtools installed bigWigSummary(UCSCtools) installed twoBitToFa(UCSCtools) installed twobitInfo(UCSCtools) installed bedGraphToBigWig(UCSCtools) installed Rscript installed no naked DNA bias matrix, use mtDNA reads to estimate Step0 check input Data and parameters DONE Step1: check quality and reformat data summarize reads count distribution Step1: check quality and reformat data DONE running time for Step1 : 14.279722929000854 Step2: bias estimation chrM reads number < 500k, obtain pre-processed bias matrix from naked DNA data [ERROR] no naked DNA bias matrix, cannot estimate bias error occurs, check log file~!

The files created during the SELMA run look like the following: [ 192] test_chrM ├── [ 11K] hg38.sizes ├── [678K] test_chrM_chrM.bed ├── [430M] test_chrM_chromatin.bed └── [ 885] test_chrM_progress.log

Thanks.

cahn20 commented 9 months ago

I apologize for all the questions.

I tried to see if running the test command for bulk mode provided by the SELMA readme with --keeptmp gave the bias "corrected" cleavage tracks, but I can't find them. Perhaps I'm missing something?

Command SELMA -m bulk -i testdata_reads.bed.gz -p testpeak.bed -g hg38 -f PE -o testbulk -t ATAC -s ~/Downloads/hg38.2bit --keeptmp

Output /usr/local/bin/SELMA:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import('pkg_resources').run_script('SELMA==1.0', 'SELMA') process log is printed into testbulk_progress.log Start SELMA bulk Step0: check input Data and parameters detected bed file format is PE bedtools installed bigWigSummary(UCSCtools) installed twoBitToFa(UCSCtools) installed twobitInfo(UCSCtools) installed bedGraphToBigWig(UCSCtools) installed Rscript installed Step0 check input Data and parameters DONE Step1: check quality and reformat data summarize reads count distribution Step1: check quality and reformat data DONE running time for Step1 : 67.17461681365967 Step2: bias estimation obtain pre-processed bias matrix from naked DNA data Step2: bias estimation running time for Step2: bias estimation: 1.9309027194976807 Step3: peak detection obtain 6607 peaks from (-p) inputted, extend peak to +/- 200bp from peak center running time for Step3: peak detection 0.06051301956176758 Step4: generate cleavage and bias profile on peaks split fragments to strand specific cleavage sites remove redundant position from the extended peak file readin sequence from 2bit calculate bias expected cleavages pile up cleavage sites running time for Step4: generate cleavage and bias profile on peaks 359.35685992240906 StepFinal: summary Collect results --keeptmp was set, keep intermediate results Generate summary reports pdflatex was not installed, SELMA will not generate pdf version of summary report. Please copy the testbulk_summaryReports.tex to an environment with pdflatex installed and complie the pdf file running time for StepFinal: summary 0.497211217880249 SELMA finished. Check the folder: testbulk/summary/ for results

tmpResults Directory Structure

[ 832] tmpResults/ ├── [ 95K] chr18_mergePeaks.bed ├── [ 25M] chr18_minusCuts.bed ├── [ 14M] chr18_minusCutsOnPeak.bed ├── [ 25M] chr18_plusCuts.bed ├── [ 14M] chr18_plusCutsOnPeak.bed ├── [146K] chr19_mergePeaks.bed ├── [ 72M] chr19_minusCuts.bed ├── [ 68M] chr19_minusCutsOnPeak.bed ├── [ 72M] chr19_plusCuts.bed ├── [ 68M] chr19_plusCutsOnPeak.bed ├── [ 11K] hg38.sizes ├── [ 80M] testbulk_biasExpCuts_minus.bdg ├── [ 80M] testbulk_biasExpCuts_minus_sorted.bdg ├── [ 80M] testbulk_biasExpCuts_plus.bdg ├── [ 80M] testbulk_biasExpCuts_plus_sorted.bdg ├── [ 0] testbulk_chrM.bed ├── [1.4G] testbulk_chromatin.bed ├── [ 64M] testbulk_cleavage_minus.bdg ├── [1.0G] testbulk_cleavage_minus.bed ├── [ 64M] testbulk_cleavage_minus_sorted.bdg ├── [ 64M] testbulk_cleavage_plus.bdg ├── [1.0G] testbulk_cleavage_plus.bed ├── [ 64M] testbulk_cleavage_plus_sorted.bdg └── [152K] testbulk_summitEXTmerge.bed

Tarela commented 9 months ago

Thanks for letting me know. The error is caused by limited chrM reads as you mentioned. I preset a cutoff for sufficient chrM reads to ensure the estimation is robust. I’ll customize a script with chromatin reads for bias estimation for you. I’ll try to send you ASAP. Have a wonderful thanksgiving!BestShawnOn Nov 21, 2023, at 2:07 PM, Chris @.***> wrote: I apologize for all the questions. I tried to see if running the test command for bulk mode provided by the SELMA readme with --keeptmp gave the bias "corrected" cleavage tracks, but I can't find them. Perhaps I'm missing something? Command SELMA -m bulk -i testdata_reads.bed.gz -p testpeak.bed -g hg38 -f PE -o testbulk -t ATAC -s ~/Downloads/hg38.2bit --keeptmp Output /usr/local/bin/SELMA:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import('pkg_resources').run_script('SELMA==1.0', 'SELMA') process log is printed into testbulk_progress.log Start SELMA bulk Step0: check input Data and parameters detected bed file format is PE bedtools installed bigWigSummary(UCSCtools) installed twoBitToFa(UCSCtools) installed twobitInfo(UCSCtools) installed bedGraphToBigWig(UCSCtools) installed Rscript installed Step0 check input Data and parameters DONE Step1: check quality and reformat data summarize reads count distribution Step1: check quality and reformat data DONE running time for Step1 : 67.17461681365967 Step2: bias estimation obtain pre-processed bias matrix from naked DNA data Step2: bias estimation running time for Step2: bias estimation: 1.9309027194976807 Step3: peak detection obtain 6607 peaks from (-p) inputted, extend peak to +/- 200bp from peak center running time for Step3: peak detection 0.06051301956176758 Step4: generate cleavage and bias profile on peaks split fragments to strand specific cleavage sites remove redundant position from the extended peak file readin sequence from 2bit calculate bias expected cleavages pile up cleavage sites running time for Step4: generate cleavage and bias profile on peaks 359.35685992240906 StepFinal: summary Collect results --keeptmp was set, keep intermediate results Generate summary reports pdflatex was not installed, SELMA will not generate pdf version of summary report. Please copy the testbulk_summaryReports.tex to an environment with pdflatex installed and complie the pdf file running time for StepFinal: summary 0.497211217880249 SELMA finished. Check the folder: testbulk/summary/ for results Output Directory Structure [ 832] tmpResults/ ├── [ 95K] chr18_mergePeaks.bed ├── [ 25M] chr18_minusCuts.bed ├── [ 14M] chr18_minusCutsOnPeak.bed ├── [ 25M] chr18_plusCuts.bed ├── [ 14M] chr18_plusCutsOnPeak.bed ├── [146K] chr19_mergePeaks.bed ├── [ 72M] chr19_minusCuts.bed ├── [ 68M] chr19_minusCutsOnPeak.bed ├── [ 72M] chr19_plusCuts.bed ├── [ 68M] chr19_plusCutsOnPeak.bed ├── [ 11K] hg38.sizes ├── [ 80M] testbulk_biasExpCuts_minus.bdg ├── [ 80M] testbulk_biasExpCuts_minus_sorted.bdg ├── [ 80M] testbulk_biasExpCuts_plus.bdg ├── [ 80M] testbulk_biasExpCuts_plus_sorted.bdg ├── [ 0] testbulk_chrM.bed ├── [1.4G] testbulk_chromatin.bed ├── [ 64M] testbulk_cleavage_minus.bdg ├── [1.0G] testbulk_cleavage_minus.bed ├── [ 64M] testbulk_cleavage_minus_sorted.bdg ├── [ 64M] testbulk_cleavage_plus.bdg ├── [1.0G] testbulk_cleavage_plus.bed ├── [ 64M] testbulk_cleavage_plus_sorted.bdg └── [152K] testbulk_summitEXTmerge.bed

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

Tarela commented 9 months ago

The one ends with .bdg is the one you need(biasExpCut). Usually it will automatically transform to bigwig format but it seems that you don’t have UCSCtools installed so the software cannot find the bdg2bw script for the auto transformation. Here you have two option 1. Download and install UCSCtools and transform it yourself or rerun Selma.  2. The bdg file is actually an old format called .wiggle file. You can replace the suffix too .wig and then you can load it using IGV or other browserI’m sorry that I didn’t propose a bias corrected track in SElMa (sorry I just realized that’s what you need probably). If you need a bias corrected cut track you can simply subtract the biasExpCut track (before or after normalization) from the observed cut track in basepair resolution or other resolutions. Hope that helps. Any further comments, suggestions and questions are welcome. BestShawn On Nov 21, 2023, at 2:07 PM, Chris @.***> wrote: I apologize for all the questions. I tried to see if running the test command for bulk mode provided by the SELMA readme with --keeptmp gave the bias "corrected" cleavage tracks, but I can't find them. Perhaps I'm missing something? Command SELMA -m bulk -i testdata_reads.bed.gz -p testpeak.bed -g hg38 -f PE -o testbulk -t ATAC -s ~/Downloads/hg38.2bit --keeptmp Output /usr/local/bin/SELMA:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html import('pkg_resources').run_script('SELMA==1.0', 'SELMA') process log is printed into testbulk_progress.log Start SELMA bulk Step0: check input Data and parameters detected bed file format is PE bedtools installed bigWigSummary(UCSCtools) installed twoBitToFa(UCSCtools) installed twobitInfo(UCSCtools) installed bedGraphToBigWig(UCSCtools) installed Rscript installed Step0 check input Data and parameters DONE Step1: check quality and reformat data summarize reads count distribution Step1: check quality and reformat data DONE running time for Step1 : 67.17461681365967 Step2: bias estimation obtain pre-processed bias matrix from naked DNA data Step2: bias estimation running time for Step2: bias estimation: 1.9309027194976807 Step3: peak detection obtain 6607 peaks from (-p) inputted, extend peak to +/- 200bp from peak center running time for Step3: peak detection 0.06051301956176758 Step4: generate cleavage and bias profile on peaks split fragments to strand specific cleavage sites remove redundant position from the extended peak file readin sequence from 2bit calculate bias expected cleavages pile up cleavage sites running time for Step4: generate cleavage and bias profile on peaks 359.35685992240906 StepFinal: summary Collect results --keeptmp was set, keep intermediate results Generate summary reports pdflatex was not installed, SELMA will not generate pdf version of summary report. Please copy the testbulk_summaryReports.tex to an environment with pdflatex installed and complie the pdf file running time for StepFinal: summary 0.497211217880249 SELMA finished. Check the folder: testbulk/summary/ for results Output Directory Structure [ 832] tmpResults/ ├── [ 95K] chr18_mergePeaks.bed ├── [ 25M] chr18_minusCuts.bed ├── [ 14M] chr18_minusCutsOnPeak.bed ├── [ 25M] chr18_plusCuts.bed ├── [ 14M] chr18_plusCutsOnPeak.bed ├── [146K] chr19_mergePeaks.bed ├── [ 72M] chr19_minusCuts.bed ├── [ 68M] chr19_minusCutsOnPeak.bed ├── [ 72M] chr19_plusCuts.bed ├── [ 68M] chr19_plusCutsOnPeak.bed ├── [ 11K] hg38.sizes ├── [ 80M] testbulk_biasExpCuts_minus.bdg ├── [ 80M] testbulk_biasExpCuts_minus_sorted.bdg ├── [ 80M] testbulk_biasExpCuts_plus.bdg ├── [ 80M] testbulk_biasExpCuts_plus_sorted.bdg ├── [ 0] testbulk_chrM.bed ├── [1.4G] testbulk_chromatin.bed ├── [ 64M] testbulk_cleavage_minus.bdg ├── [1.0G] testbulk_cleavage_minus.bed ├── [ 64M] testbulk_cleavage_minus_sorted.bdg ├── [ 64M] testbulk_cleavage_plus.bdg ├── [1.0G] testbulk_cleavage_plus.bed ├── [ 64M] testbulk_cleavage_plus_sorted.bdg └── [152K] testbulk_summitEXTmerge.bed

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>