UcarLab / CoRE-ATAC

MIT License
10 stars 1 forks source link

Negative Array Size Exception at "Getting insert features" step. #3

Open dylan6thomas opened 2 years ago

dylan6thomas commented 2 years ago
Screen Shot 2022-01-05 at 7 03 00 PM

I am trying to run CoRE-ATAC feature extraction from the JupyterLab command prompt. However, when I reach the "getting insert features" step I am returned with a huge number of Negative Array Size Exceptions. I tried running the PEAS alone based on the instructions from the PEAS manual, but I was returned with a null pointer exception. I tried changing datasets, but I cannot figure out what is wrong. The dataset I am using is here.

The command I ran is the following: singularity exec ./CoRE-ATAC-FeaturePredictor-hg19/ /CoRE-ATAC/CoRE-ATACFeatureExtraction-singularity.sh /data/home/dylan.thomas/CoRE-ATAC/Data/PBMC1k/atac_pbmc_1k_v1_possorted.bam /data/home/dylan.thomas/CoRE-ATAC/Data/PBMC1k/atac_pbmc_1k_v1_peaks.bed hg19 /data/home/dylan.thomas/CoRE-ATAC/Data/unzipped/hg19.fa /data/home/dylan.thomas/CoRE-ATAC/Data/chromosomes.txt /data/home/dylan.thomas/CoRE-ATAC/Data/core-atac-out/pbmc1k/

ajt986 commented 2 years ago

Hello, this error occurs when the peak lengths are smaller than 5bp. PEAS tries to identify dense cut sites by using a sliding window of 5bp and therefore expects the peak lengths to at least be this size. What this tells me is that there might be a problem with the peaks input file. It's reading peaks, but for some reason it's reading small peak lengths. What is the length distribution of the peak file?

dylan6thomas commented 2 years ago

Sorry, I'm still new to the types for file formats that are used in ATAC-seq reads. How would I be able to find the peak length distribution?

ajt986 commented 2 years ago

This would just be the bed file. The format should be a tab delimited file with the first three columns being: chromosome, start, end. The length is end-start and all of these should be > 5bp to avoid the error. I would expect the peak lengths to be in the hundreds for the majority.

dylan6thomas commented 2 years ago

If I run feature extraction without removing these peaks, will it cause the output to be inaccurate, or will it just omit the peaks of <5bp?

ajt986 commented 2 years ago

Consider recalling peaks since it seems like there are many of these <5bp length peaks. I'm curious if these are the peak summit positions. Do you mind showing an example of the first few lines of the peak file used as input?

dylan6thomas commented 2 years ago
Screen Shot 2022-01-06 at 1 46 34 PM

Here is a screenshot of my peak file. They're directly from 10x Genomics, so I am not sure why I am having trouble with this dataset. I tried using a 500 cell one and that worked just fine, but the 1k cell dataset is not working. I am not sure what you mean by recalling peaks.

ajt986 commented 2 years ago

This looks fine to me, you shouldn't need to recall the peaks (using MACS2 on the bam file for example). I guess there are a few instances where the peak length is < 5bp? If you use excel and subtract the start position from the end position, what percent of peaks are < 5 in length? The error suggests that they exist, but if there aren't any then there may be something else going on. If it's low, then you can consider removing these and running CoRE-ATAC on the filtered set.

dylan6thomas commented 2 years ago

Will I have to change the bam file after doing that, or are they not related?

ajt986 commented 2 years ago

No, the bam file contains all the sequence reads. The peaks are just where a peak caller is seeing a spike in the number of reads to separate real chromatin accessibility from noise. This is why it's surprising to see this error since I wouldn't expect to see a peak < 5bp in length when each end of the paired-end reads are 50bp.

dylan6thomas commented 2 years ago
Screen Shot 2022-01-06 at 2 20 53 PM

There seems to only be 111 out of 38347 peaks with less than 5bp, will that be a problem?

ajt986 commented 2 years ago

Shouldn't be a problem removing these peaks from the bed file and running CoRE-ATAC on the filtered peak set. Alternatively, if you want to keep them, you can pad the start and end positions so that they are long enough.

dylan6thomas commented 2 years ago

Okay, sounds good! I'll let you know when CoRE-ATAC is done running. Thank you!