eddieimada / REPAC

5 stars 2 forks source link

PA annotation details #2

Open esebesty opened 11 months ago

esebesty commented 11 months ago

Just started using REPAC and was wondering about the annotation details, found in

library("REPAC")

data("hg38_pa")
data("mm10_pa")

Both the human and the mouse annotation contains 3UTR, CDS and IN annotation types. However, I can't find any description for them or how the data was generated exactly. For example, in the paper, I see the number 67509 for human 3' UTR PAS, but the above dataset contains 68423 hg38 3UTRs. Is there a more detailed description/code somewhere, that I can check? Thanks!

eddieimada commented 11 months ago

Hi Endre,

Thank you for your interest in REPAC. These annotations were derived from the polyAsite database. We took the annotations provided in the database and overlapped with current hg38 and mm10 annotations from the annotatr package. We also removed sites that overlapped to more than one gene. The differences in number of sites might be due to an updated annotation version than the one we used when the paper was written. If you find anything odd, please let me know and I will look into it!

esebesty commented 11 months ago

Hi, I would be interested in replicating the annotations available in the package. Are the scripts and R package versions used to generate the hg38_pa and mm10_pa datasets available somewhere? For example, polyAsite database mentions that Number of poly(A) site clusters: 569,005. So how did this lead to the 68423 3' UTRs exactly? Which annotatr package version, annotation version, exact filtering steps, etc? Something similar to what is usually described for Bioconductor package external data.

eddieimada commented 11 months ago

I believe the closest script to obtain the current annotations would be: https://github.com/eddieimada/REPAC_paper/blob/main/code/Bcell/00createBED.R

The input bed file used in this script was obtained using QAPA build with ENSEMBL v102 and PolyaDB v2 annotations.

I'm currently working on putting the package on Bioconductor – when I do – I will update the annotations and log the versions.

esebesty commented 8 months ago

Looking forward to the Bioconductor package! I just checked the linked R script, and it seems that you are further processing of the output of the qapa build command for the mouse data.

Is this also true for the human data? Looks like the there is another 00createBED.R script, referencing a hg38 utr list here, that might be coming from QAPA or other places.

reck999 commented 3 months ago

Hi Eddie,

Thank you for an excellent package! I've had a lot of success using it to study my mouse and human sets with my own bams. I work in C. elegans too and would love to use REPAC to probe APA changes in C. elegans. Would you be able to create a reference file from PolyASite for C. elegans? I've taken a few stabs at it myself, and even though C. elegans is supported by PolyASite, I haven't gotten it to work. Any advice would be great too. Thank you!

Randall