Closed balwierz closed 8 months ago
Question 1: File state_annotations_processed.csv outlines all the states' characteristic. States' long characterization are supposed to be long, sometimes multiple-line, because they are copied from the long characterization of states in AF3_full_stack_states_characterizations.xlsx file, "characterize_full_stack_state" tab (in supplementary materials of the paper). They are in fact, not strangely long but rather short, if we wanted a very detailed characterization of states. They contain our own comments on the states, based on looking at the various analyses that we did, outlined in the paper. I encourage you to look at our AF3-5 to see all the states' characteristics.
The particular comments you referred to ("58,ReprPC6,"24_ReprPC (H3K27me3) in Brain, Epithelial, muscles, Mesench, Neurosph, Mystat, IM90, Adipose. Others quescient",polycomb repressed,polycomb repressed,#808080,22") --> states ReprPC6 in our paper shows the most enrichments to the state 24_ReprPC (polycomb repressed element marked by only H3K27me3 presence) in a multiple cell types, and are most enriched with the quiescent state in other cell types. Here, the state 24_reprPC is actually one state in the per-cell-type model that were used in Roadmap to annotate 127 biosamples (https://egg2.wustl.edu/roadmap/web_portal/imputed.html#chr_imp). This comment is based on our enrichment analyses between the full-stack states with 25-state per-cell-type annotation, shown in Supp. Fig. 8-9 (AF1) and also in excel form in AF5. We analyzed, for each full-stack state, their enrichments and probabilities of overlapping each of the 25 per-cell-type states in the annotation of 127 biosamples from Roadmap. This analyses help us understand whether each full-stack state is associated with any chromatin states that are used to annotate individual cell type/ biosample.
Question 2: We apologize for the confusion. Quick answer: The files state_annotations_processed.csv should be updated by us to fully clarify all the columns, and get rid of columns that are confusing to end-users. Please, instead, look at file AF3 published in our paper to see our state names and state characterization. Please check back on Thursday 03.07.2024 for the most up-to-date state_annotations_processed.csv files and readme. Note that nothing about our results/states are changed, they just need to be formatted in a way that avoid confusion.
Long answer: Our states, as denoted in all the bed files that we published, are 1-based (starting from 1_GapArtf1 to 100_TSS2). The most accurate file outlining the states are our AF3, tab "characterize_full_stack_state". In this excel file, you will see that the first column (state_order_by_group) are 0-based (0--> 99). Second column 'state_numbers in raw full_stack annotation' shows that state number that were first outputted by ChromHMM, this corresponds to the state numbers that you see in the column 'state' in files state_annotations_processed.csv. This column was initially useful to us to transfer from the raw states (outputted by ChromHMM) to the final state names (ordered by us, based on the states' biological characterizations). Now, we realized, thanks to your pointing out, that they are more confusing than helpful. We will get rid of them and update this file state_annotations_processed.csv to reflect the only type of state names that matter to you (the ones we provided in our paper 1_GapArtf1 --> 100_TSS2). In the mean time, use the files AF3 to see states’ characterizations.
The annotation you mentioned: where a state is given the detailed comment "23_PromBiv flanking in other cell types. H3K27me3 and H3K4me1" corresponds to state BivProm3 (name on in the .bed.gz file: 90_BivProm3) and corresponds to the following line in our state characterization in AF3
It means that we commented about this state, based on our analyses, that it is most enriched with the 2 states in the 25-state per-cell-type model from Roadmap (https://egg2.wustl.edu/roadmap/web_portal/imputed.html#chr_imp). First, it is most enriched with state 24_ReprPC (polyComb repressed state) in cell types profiled in ENCODE, Blood &Tcell, Digestive, HSC&B-cell, Sm.Muscle.. Second, it is most enriched with the state 23_ProvBiv (bivalent promoter state) in all other cell types/biosamples from the 127 ROADMAP samples that we analyzed. In other words, this state in our model (90_BivProm3) sometimes corresponds to a bivalent promoter in different cell types, sometimes it corresponds to a polycome repressed region in other cell types.
Hello,
Thanks for this resource! I am trying to use it, but I have two problems.
1) In
state_annotations_processed.csv
there are strange looking to me long descriptions. Several of them contain new lines, dots, stars. This one for instancecontains a new line and mentions another state in the long description:
24_ReprPC
, but there is no such state, and I do not understand what a mention of a state in a description of another state would mean.2) I also downloaded the following files from the Genome Biology paper: https://public.hoffman2.idre.ucla.edu/ernst/2K9RS//full_stack/full_stack_annotation_public_release/hg38/state_annotations_processed.csv, https://public.hoffman2.idre.ucla.edu/ernst/2K9RS//full_stack/full_stack_annotation_public_release/hg38/hg38_genome_100_segments.bed.gz
ReprPC6
has the same line in the downloadedstate_annotations_processed.csv
as in github:But
ReprPC6
is paired with23
inhg38_genome_100_segments.bed.gz
filechr1 79400 79600 23_ReprPC6
That is off-by one from the last column in the annotation (22
). And state 23 is something completely different in state annotations:So when I encounter
23_ReprPC6
in the bed file which state does it refer to?Best regards, Piotr