Closed cxshine closed 4 years ago
Glad I can help. I am not 100% without seeing actual data, but this probably is complaining because you have bins with input bam read (or fragment) count of 0. i.e., you have RNA count but there’s no DNA. This might be caused by incorrect —min —max parameters for inserts or simply because it is expecting paired end but the insert size is unknown for single end data. I will look into this. In the meantime, if you could show me several lines of you input and output bam, it would help me to troubleshoot.
Thank you for your help, actually I'm going to work with paired end data first. There is my input bam:
There is my output bam:
Your output bam shows insert size (size between a read pair) less than 100bp. If you haven’t specified, the default minimum insert size is 200bp. You can try lowering this cutoff to include more usable fragments. However, I’d first make sure if this is expected by the design of the experiment. Your DNA fragments are ~400bp and RNA is ~70bp.
I adjusted the arguments, but it also doesn't work. the command: starrpeaker --prefix test --chromsize $hg19_chormsize --blacklist $hg19_blacklist --cov $hg19_gc $hg19_mappability $hg19_linearfold --input input.bam --output merge_output.bam --threshold 0.05 --min 10 If you have time, you can use the input and output bam to figure it out. Thank you very much !
This error was from the peak calling step, where it finds 0 regions that has RNA output greater than normalized DNA input. This usually happens when min/max is not set properly, so many fragments/reads are filtered out during bam processing. For this, what I can do is update the code to give you more informative error message. I will look into updating the code.
I looked into this dataset more closely. This is a STARR-seq dataset on HepG2 (GEO accession GSE68331) and they are composed of 1 genomic library and 7 biological replicates. It claims to target ~100 putative regulatory elements.
After mapping to a reference human genome, I noticed their input genomic library contains many fragments that appear as optical duplicates. This implies that, instead of sonicating the genome and capturing regions of interest, their genomic fragments were selectively PCR-amplified and transfection into 7 replicates. Therefore, this dataset requires different preprocessing than other captured- or whole-genome-STARR-seq dataset. One difference would be to avoid using deduplication using Picard. Moreover, in my opinion, this specific dataset may not be suitable for STARRPeaker (or any other peak calling algorithms like Macs2). Instead, a modified version of the MPRA pipeline would be more suitable. They also mention all replicates were pooled to increase statistical power. (https://genome.cshlp.org/content/25/8/1206.long)
Alternatively, because their candidate regions are highly specific, you might simply be able to calculate RNA/DNA ratio from counting the fragment depth.
I hope this helps!
First of all, thank you for the files, and I totally used your files to process my data. And I had a new issue which is described as follows: How to solve the problem? Is there something wrong with my input.bam or output.bam?