ernstlab / full_stack_ChromHMM_annotations

Data of genome annotation from full-stack ChromHMM model trained with 1032 datasets from 127 reference epigenomes
32 stars 1 forks source link

State annotations inconsitencies (github and the Genome Biology paper results) #2

Closed balwierz closed 8 months ago

balwierz commented 8 months ago

Hello,

Thanks for this resource! I am trying to use it, but I have two problems.

1) In state_annotations_processed.csv there are strange looking to me long descriptions. Several of them contain new lines, dots, stars. This one for instance

58,ReprPC6,"24_ReprPC (H3K27me3) in Brain, Epithelial, muscles, Mesench, 
Neurosph, Mystat, IM90, Adipose. Others quiescent",polycomb repressed,polycomb repressed,#808080,22

contains a new line and mentions another state in the long description: 24_ReprPC, but there is no such state, and I do not understand what a mention of a state in a description of another state would mean.

2) I also downloaded the following files from the Genome Biology paper: https://public.hoffman2.idre.ucla.edu/ernst/2K9RS//full_stack/full_stack_annotation_public_release/hg38/state_annotations_processed.csv, https://public.hoffman2.idre.ucla.edu/ernst/2K9RS//full_stack/full_stack_annotation_public_release/hg38/hg38_genome_100_segments.bed.gz

ReprPC6 has the same line in the downloaded state_annotations_processed.csv as in github:

58,ReprPC6,"24_ReprPC (H3K27me3) in Brain, Epithelial, muscles, Mesench,
Neurosph, Mystat, IM90, Adipose. Others quescient",polycomb repressed,polycomb repressed,#808080,22

But ReprPC6 is paired with 23 in hg38_genome_100_segments.bed.gz file chr1 79400 79600 23_ReprPC6 That is off-by one from the last column in the annotation (22). And state 23 is something completely different in state annotations:

23_PromBiv flanking in other cell types. H3K27me3 and H3K4me1","bivalent promoter; 
non-spec",bivalent promoters,#7030A0,89

So when I encounter 23_ReprPC6 in the bed file which state does it refer to?

Best regards, Piotr

havu73 commented 8 months ago

Question 1: File state_annotations_processed.csv outlines all the states' characteristic. States' long characterization are supposed to be long, sometimes multiple-line, because they are copied from the long characterization of states in AF3_full_stack_states_characterizations.xlsx file, "characterize_full_stack_state" tab (in supplementary materials of the paper). They are in fact, not strangely long but rather short, if we wanted a very detailed characterization of states. They contain our own comments on the states, based on looking at the various analyses that we did, outlined in the paper. I encourage you to look at our AF3-5 to see all the states' characteristics.

The particular comments you referred to ("58,ReprPC6,"24_ReprPC (H3K27me3) in Brain, Epithelial, muscles, Mesench, Neurosph, Mystat, IM90, Adipose. Others quescient",polycomb repressed,polycomb repressed,#808080,22") --> states ReprPC6 in our paper shows the most enrichments to the state 24_ReprPC (polycomb repressed element marked by only H3K27me3 presence) in a multiple cell types, and are most enriched with the quiescent state in other cell types. Here, the state 24_reprPC is actually one state in the per-cell-type model that were used in Roadmap to annotate 127 biosamples (https://egg2.wustl.edu/roadmap/web_portal/imputed.html#chr_imp). This comment is based on our enrichment analyses between the full-stack states with 25-state per-cell-type annotation, shown in Supp. Fig. 8-9 (AF1) and also in excel form in AF5. We analyzed, for each full-stack state, their enrichments and probabilities of overlapping each of the 25 per-cell-type states in the annotation of 127 biosamples from Roadmap. This analyses help us understand whether each full-stack state is associated with any chromatin states that are used to annotate individual cell type/ biosample.

Question 2: We apologize for the confusion. Quick answer: The files state_annotations_processed.csv should be updated by us to fully clarify all the columns, and get rid of columns that are confusing to end-users. Please, instead, look at file AF3 published in our paper to see our state names and state characterization. Please check back on Thursday 03.07.2024 for the most up-to-date state_annotations_processed.csv files and readme. Note that nothing about our results/states are changed, they just need to be formatted in a way that avoid confusion.

Long answer: Our states, as denoted in all the bed files that we published, are 1-based (starting from 1_GapArtf1 to 100_TSS2). The most accurate file outlining the states are our AF3, tab "characterize_full_stack_state". In this excel file, you will see that the first column (state_order_by_group) are 0-based (0--> 99). Second column 'state_numbers in raw full_stack annotation' shows that state number that were first outputted by ChromHMM, this corresponds to the state numbers that you see in the column 'state' in files state_annotations_processed.csv. This column was initially useful to us to transfer from the raw states (outputted by ChromHMM) to the final state names (ordered by us, based on the states' biological characterizations). Now, we realized, thanks to your pointing out, that they are more confusing than helpful. We will get rid of them and update this file state_annotations_processed.csv to reflect the only type of state names that matter to you (the ones we provided in our paper 1_GapArtf1 --> 100_TSS2). In the mean time, use the files AF3 to see states’ characterizations.

The annotation you mentioned: where a state is given the detailed comment "23_PromBiv flanking in other cell types. H3K27me3 and H3K4me1" corresponds to state BivProm3 (name on in the .bed.gz file: 90_BivProm3) and corresponds to the following line in our state characterization in AF3

image

It means that we commented about this state, based on our analyses, that it is most enriched with the 2 states in the 25-state per-cell-type model from Roadmap (https://egg2.wustl.edu/roadmap/web_portal/imputed.html#chr_imp). First, it is most enriched with state 24_ReprPC (polyComb repressed state) in cell types profiled in ENCODE, Blood &Tcell, Digestive, HSC&B-cell, Sm.Muscle.. Second, it is most enriched with the state 23_ProvBiv (bivalent promoter state) in all other cell types/biosamples from the 127 ROADMAP samples that we analyzed. In other words, this state in our model (90_BivProm3) sometimes corresponds to a bivalent promoter in different cell types, sometimes it corresponds to a polycome repressed region in other cell types.