Peak*cell data - Githubissues

wawpaopao commented 2 months ago

I would like to inquire about the possibility of obtaining the peak-by-cell matrix data (pseudo-bulk ATAC seq) derived from the peak calling and filtering process. Specifically, the 500 bp peak regions would be of great value for training neural network models.

If it is not possible to share the processed data, I would greatly appreciate if you could provide some guidance on the specific steps involved in the processing pipeline. Currently, the fragments.tsv data has been made available. Particularly, I would like to clarify the requirements for performing peak calling on the pseudo-bulk ATAC-seq fragments. From my understanding, it would be necessary to have a CSV file containing the barcode and cell_type mapping information to get {cell_type}_ATAC.bam file.

In the methods section of peak calling and filtering, ''''For comparative analysis of human ATAC–seq peaks, we first removed peaks that were mapping to a region in any of the four species that had low read mappability. To identify regions with low mappability in our ATAC–seq data, we counted all reads in 1 kb bins across each genome. We took 1 kb bins with 0 reads and, for the remaining bins, we took the 0.02 quantile for the number of reads mapped and extended by 1 kb in both directions giving us 3 kb low-mappability bins. Finally, low-mappability bins within 5 kb were stitched together, providing our final list of low ATAC–seq mappability regions. Peaks or orthologous elements falling within any of these regions in any species were excluded from the comparative analysis.'''' I wanted to ask if the code corresponding to this part is available in the GitHub repository, as I haven't been able to locate it.

Thank you very much!

ejarmand commented 2 months ago

Hi Wawpaopao,

I would suggest using SnapATAC2.

You can load a fragment file using snapatac2.pp.import_data (api reference here). After loading the fragment file snapatac2 can make 500 bp regions throughout the geneome snapatac2.pp.add_tile_matrix(api reference here). You can also use the peaks we previously identified (supplementary tables 4, 5, 6, 7) for human, macaque, marmoset and mouse respectively. To do so you would convert these intervals into a list of strings in python with the format “chr[]:start-end”, and use that as input to snapatac2.pp.make_peak_matrix (api reference here).

To make psuedobulk files for running macs2 you can add cell metadata (found here on GEO) and export cell-type fragment files which should function equivalently for MACS2. You can run this using snapatac2.ex.export_fragments (api reference here).

I will see if we have scripts for making black lists, but it should be thoroughly enough described that it's easy to replicate. This analysis was done on bulk bamfiles (available on GEO), which can be counted using bedtools or other software. Identified regions were extended using bedtools slop, and used bedtools merge to collate nearby windows of low coverage.

Let me know if you have further questions.

wawpaopao commented 2 months ago

Thank you for the detailed suggestions and guidance!!!

ejarmand / comparative_epigenomic_motor_cortex

Peak*cell data #2