Open dylan6thomas opened 2 years ago
Hello, this error occurs when the peak lengths are smaller than 5bp. PEAS tries to identify dense cut sites by using a sliding window of 5bp and therefore expects the peak lengths to at least be this size. What this tells me is that there might be a problem with the peaks input file. It's reading peaks, but for some reason it's reading small peak lengths. What is the length distribution of the peak file?
Sorry, I'm still new to the types for file formats that are used in ATAC-seq reads. How would I be able to find the peak length distribution?
This would just be the bed file. The format should be a tab delimited file with the first three columns being: chromosome, start, end. The length is end-start and all of these should be > 5bp to avoid the error. I would expect the peak lengths to be in the hundreds for the majority.
If I run feature extraction without removing these peaks, will it cause the output to be inaccurate, or will it just omit the peaks of <5bp?
Consider recalling peaks since it seems like there are many of these <5bp length peaks. I'm curious if these are the peak summit positions. Do you mind showing an example of the first few lines of the peak file used as input?
Here is a screenshot of my peak file. They're directly from 10x Genomics, so I am not sure why I am having trouble with this dataset. I tried using a 500 cell one and that worked just fine, but the 1k cell dataset is not working. I am not sure what you mean by recalling peaks.
This looks fine to me, you shouldn't need to recall the peaks (using MACS2 on the bam file for example). I guess there are a few instances where the peak length is < 5bp? If you use excel and subtract the start position from the end position, what percent of peaks are < 5 in length? The error suggests that they exist, but if there aren't any then there may be something else going on. If it's low, then you can consider removing these and running CoRE-ATAC on the filtered set.
Will I have to change the bam file after doing that, or are they not related?
No, the bam file contains all the sequence reads. The peaks are just where a peak caller is seeing a spike in the number of reads to separate real chromatin accessibility from noise. This is why it's surprising to see this error since I wouldn't expect to see a peak < 5bp in length when each end of the paired-end reads are 50bp.
There seems to only be 111 out of 38347 peaks with less than 5bp, will that be a problem?
Shouldn't be a problem removing these peaks from the bed file and running CoRE-ATAC on the filtered peak set. Alternatively, if you want to keep them, you can pad the start and end positions so that they are long enough.
Okay, sounds good! I'll let you know when CoRE-ATAC is done running. Thank you!
I am trying to run CoRE-ATAC feature extraction from the JupyterLab command prompt. However, when I reach the "getting insert features" step I am returned with a huge number of Negative Array Size Exceptions. I tried running the PEAS alone based on the instructions from the PEAS manual, but I was returned with a null pointer exception. I tried changing datasets, but I cannot figure out what is wrong. The dataset I am using is here.
The command I ran is the following:
singularity exec ./CoRE-ATAC-FeaturePredictor-hg19/ /CoRE-ATAC/CoRE-ATACFeatureExtraction-singularity.sh /data/home/dylan.thomas/CoRE-ATAC/Data/PBMC1k/atac_pbmc_1k_v1_possorted.bam /data/home/dylan.thomas/CoRE-ATAC/Data/PBMC1k/atac_pbmc_1k_v1_peaks.bed hg19 /data/home/dylan.thomas/CoRE-ATAC/Data/unzipped/hg19.fa /data/home/dylan.thomas/CoRE-ATAC/Data/chromosomes.txt /data/home/dylan.thomas/CoRE-ATAC/Data/core-atac-out/pbmc1k/